> I thought entropy (in the Shannon sense) was a property of discrete and finite probability distributions. It's essentially a measure of how random a sample from such a probability distribution is. Notably, continuous probability distributions don't have meaningful entropy (or in some sense, their entropy is always infinite).
True, but for continuous distributions you can use the KL divergence against a uniform distribution :)
One of the properties of entropy H(X) of a random variable X is that if f is a bijective function then H(f(X)) = H(X).
For relative entropy (or "KL divergence" as some people call it), we have that H(X||Y) = H(f(X)||f(Y)). But if you fix Y to have a continuous uniform distribution, then you lose this critical property because f(Y) may no longer have a continuous uniform distribution.
Apparently this "critical property" is not so important to all the people who use relative entropy as a generalization to a continuous distribution defined on a space with an underlying measure.
Why would they care about arbitrary transformations mapping points in the space to other points in the space?
What I think it means, is that if you take two different parametrizations of the same physical phenomenon, then you get two different entropy values.
E.g. if you have a bunch of particles with fixed mass. You could look at the distribution of speeds and get one entropy. Then the distribution of kinetic energy (basically speed squared). Uniform speed means non-uniform speed squared so the entropies would disagree.
Physical entropy is defined from the probability distribution over states. Velocities or squared-velocities are not states, they are derived quantities. Points in a phase space would describe states. Physical states are discrete anyway when you consider quantum physics :-)
As for the entropy of probability distributions in general, I think relative entropy is invariant under reparametrizations because both the probability of interest and the reference probability transform in the same way [1]. But I don't remember what does it mean exactly. [And I am not sure if that makes ogogmad wrong, I may not have understood well his comment.]
([Edit: forget this aside. You probably were talking about speeds as positive magnitudes.] By the way using an example analogue to yours discrete entropy wouldn't be invariant either: if you have a distribution {-1,1} and square it it collapses to a zero-entropy singleton {1}.)
+1. The commenter above also wanted cared about bijective mappings, and squaring a random variable in [-1, 1] is not bijective. Squaring a random variable defined over positive real numbers would lead to a bijective mapping and the distribution would still remain uniform.
Actually, I find it hard to come up with a bijective mapping that leads to a non uniform distribution that's useful for anything practical.
Ok so first to have a uniform distribution we have to have a bounded set. Maybe you can do something clever with limits but lets not overcomplicate things. Lets say we have 0 <= v < 10. Define E = v^2. Then 0 <= e < 100
Uniformity of v would mean that p(0 <= v < 1) = 1 / 10
Uniformity of E would mean that p(0 <= E < 1) = 1 / 100
But by construction p(0 <= v < 1) = p(0 <= E < 1). So it's not possible for both to be uniform.
It's not necessary to have p(0 <= v < 1) = p(0 <= E < 1). Only that P(f(X)) is uniform.
But this does bring up a good point. H(X||Y) = H(f(X)||f(Y)) for any bijective f if the distributions are discrete. When they are continuous this is not true, even with a bijective f. For example f = x^2 doesn't work even though it yields a binary distribution. Interestingly however, affine transformations work.
Yeah, you also have to transform the "reference" function, and then the entropy stays the same. I prefer to think of it as the "density of states" -- it's necessary to make the argument of the logarithm dimensionless, after all.
True, but for continuous distributions you can use the KL divergence against a uniform distribution :)