> Yeah sure its a projection but so what. Why and when should a L2 norm projection be useful.
We want the answer, and more generally a predictive model.
We use the data we have to
define a vector subspace.
We use that subspace to
creep up on the answer we
want. Finally to get the answer,
we project onto the subspace
we have settled on. We CAN
do the projection
even though
we don't have the answer.
We take that projection
on to the subspace,
that point in that subspace, as our
approximation to the answer.
We use L^2 because that projection,
with the Pythagorean theorem,
is the closest in L^2.
Being
close in L^2 is a good thing:
L^2 is a Hilbert space, and being
Cauchy convergent means being
convergent in L^2, and
then there is a subsequence
that converges almost surely,
with probability 1, exactly, etc.,
and in practice it's
exactly. A sequence that
converges in L^2 but not
exactly is pathological.
Then as we use more variables,
the vector subspace gets
closer to the answer and we
get a better projection
although maybe not a better
fitting model.
To me, tails of distributions
have next to nothing to do
with it or the use of L^2.
In particular, the basic
regression math,
the normal equations,
the Pythagorean theorem
total sum of squares =
regression sum of squares
+ error sum of squares
makes
no assumptions at all
about distributions
(might want to argue
that need to be in L^2,
that is, for each random
variable X have E[X^2]
finite, or maybe at least
E[|X|] finite, that is,
L^1)
certainly does not assume
a Gaussian.
Moreover the matrix
in the normal equations
has variances and
covariances, but these
are close to just the
inner product in the
Hilbert space L^2 and
do NOT assume a Gaussian
or anything about
a distribution except
merely that the
expectations exist
which in practice is
a meager assumption,
essentially always the
case.
Just because are working
with variances and covariance
does NOT mean that are
assuming a Gaussian.
Also notice that nowhere
do we need to assume
that our data is not discrete.
Discrete data is still fine.
If in addition
do have an appropriate
Gaussian assumption, then
at the end of the usual
normal equations stuff,
can get an F ratio and
some t-tests that usually
can help evaluate the quality
of the fitting and the
importance of the
coefficients.
And for the model, we
can get confidence intervals
on the predicted values --
sure, will need a Gaussian
assumption here, but
there might be other approaches
and otherwise we can
hope or look for robust
estimators.
L^2's fine with me.
More can be done, but
the above is the first
cut reason for using
subspaces and L^2
projections although
this explanation is rarely
taught to students.
When ordinary regression
doesn't work very well, then
I'm reluctant to strain
with much more. I might
be willing in some circumstances
to use what L. Breiman
(I respect Breiman, Loeve
student) has in his Classification
and Regression Trees.
No no no you are thinking this one a little wrong.
Yes Pythagorean identity, or Pythagorean decomposition, whatever you want to call it, the orthogonality properties etc etc make L2 super convenient to apply and reason about, but that does not mean it is a suitable performance metric to use. This is again the phenomena of searching for the answer where its well lit vs where the answer lies.
The problem really lies in the tails of the errors, although it might not be immediately apparent. You say variance covariance etc etc, but it does not take much for RVs not to possess them. Think about it, Chebyshev's inequality will make it clear what decay you need for those to exist.
Yes you are right choice of L_2 by itself makes no assumptions. But once you start analyzing the performance of the estimator it will show up, particularly when Fisher information is no longer strongly convex. Cramer Rao becomes a vacuous bound in this case. The other problem is L2 balls become infinitely larger than say the L1 ball as dimensionality increases. So L2 error stops being very good at localizing a vector.
The problem is L_2 is not a good metric to measure inaccuracy with, it is super convenient to use though.
Let me demonstrate one of its well known problems. Say the error residual has tails that does not decay as fast as an exponential (MGF does not exist around 0). You can then show that the accuracy can be arbitrarily bad. No need for super ugly densities, something as benign as just a mixture of two Gaussians with the same expectation will break it.
I am just restating Huber's robustness argument here in a different way.
Gauss himself was well aware of the problem and chose it for convenience. Even his peers were well aware of situations when L1 made more sense than L2. Unfortunately optimization theory was not well developed at that time to use L1 based methods. But now we can.
BTW you might find it interesting that the Pythagorean property is not limited to L2 squared. The set of all 'divergences' for which it holds is the Bregman divergence family. Its an if and only if condition. This is a result that came from the ML community. One side of the implication was known though. Bregman divergences are again the likelihood ratios of exponential family densities, (also called Darmois Koopman family in older literature), hence has strong connections with NP lemma as well. What I find mind boggling is that Bregman divergence had its origin in convex optimization literature ! Its origins had absolutely nothing, nothing to do with probability and stats. Its amazing when two separate fields of Math make contact in these ways.
BTW if you dont mind sharing your suitably anonymized mail, I am srean.list at gmail. It will be pleasure to discuss math with you. I find it very helpful when i can stress test an idea aginst a human oracle. HN might not be a forum well suited for this.
Yes, now we can also do
best L^1 approximations.
IIRC -- I'm in a hurry this
morning -- it's a linear
programming problem or
some such but I haven't
thought about that in
decades.
You bring up a lot of stuff
I've never reviewed.
> but that does not mean it is a suitable performance metric to use.
If the plane do the L^2 projection
on is really close, then the
error is really small, and what's
not to like? Really close is
good enough for me.
Right, there are three biggie
choices, L^1 (minimize the
sum of absolute values of the
errors), L^2 (minimize the sum
of squared errors), and L^infinity
(minimize the worst error).
L^2's fine with me, good enough
for government work, okay
first cut, day in and day out.
If in some particular case
L^2 has some problems, then
maybe something like regression
is not the right tool for the
problem.
Gee, we do a lot of L^2:
We get orthogonal components
so that we can get L^2
cafeteria style, just pick the
components we want.
E.g., in filtering stochastic
processes, we take a
Fourier transform -- that
is finding the coefficients
of the sample path on
the sine-cosine orthogonal
components. Or we
do a convolution, which
is the same thing in the
end. The approximation
we get is an L^2 approximation.
So, with JPG -- sure, it's L^2.
Right, JPG does funny stuff
nearly lines. Life's not
perfect!
For some things, sure we
want L^infinity: E.g.,
if we want to use a
quotient of polynomials
to approximate the
usual special functions,
then we want to minimize
the worst error and do
what is called Chebyshev
approximation, but this is
very specialized.
Indeed it is. Lets catchup more sometime. Here's my email again srean.list on gmail
Totally agreed there is lot to like about L2 but there are plenty situations where it is terrible (in fact some of them are on one topic of your interest: monitoring server farms). In some of those situations L1 or a combination of L1 and L2 helps a lot.
Unfortunately its not true in many cases I have seen, essentially because the tail of the error does not fall fast enough. Technically, for all bounded RVs all moments are bounded, but in some of these situations the variance is so high that its infinite for practical purposes.
All you need is the tail to fall slower than a quadratic.
We want the answer, and more generally a predictive model.
We use the data we have to define a vector subspace. We use that subspace to creep up on the answer we want. Finally to get the answer, we project onto the subspace we have settled on. We CAN do the projection even though we don't have the answer. We take that projection on to the subspace, that point in that subspace, as our approximation to the answer.
We use L^2 because that projection, with the Pythagorean theorem, is the closest in L^2.
Being close in L^2 is a good thing: L^2 is a Hilbert space, and being Cauchy convergent means being convergent in L^2, and then there is a subsequence that converges almost surely, with probability 1, exactly, etc., and in practice it's exactly. A sequence that converges in L^2 but not exactly is pathological.
Then as we use more variables, the vector subspace gets closer to the answer and we get a better projection although maybe not a better fitting model.
To me, tails of distributions have next to nothing to do with it or the use of L^2.
In particular, the basic regression math, the normal equations, the Pythagorean theorem
total sum of squares = regression sum of squares + error sum of squares
makes no assumptions at all about distributions (might want to argue that need to be in L^2, that is, for each random variable X have E[X^2] finite, or maybe at least E[|X|] finite, that is, L^1) certainly does not assume a Gaussian.
Moreover the matrix in the normal equations has variances and covariances, but these are close to just the inner product in the Hilbert space L^2 and do NOT assume a Gaussian or anything about a distribution except merely that the expectations exist which in practice is a meager assumption, essentially always the case.
Just because are working with variances and covariance does NOT mean that are assuming a Gaussian.
Also notice that nowhere do we need to assume that our data is not discrete. Discrete data is still fine.
If in addition do have an appropriate Gaussian assumption, then at the end of the usual normal equations stuff, can get an F ratio and some t-tests that usually can help evaluate the quality of the fitting and the importance of the coefficients.
And for the model, we can get confidence intervals on the predicted values -- sure, will need a Gaussian assumption here, but there might be other approaches and otherwise we can hope or look for robust estimators.
L^2's fine with me.
More can be done, but the above is the first cut reason for using subspaces and L^2 projections although this explanation is rarely taught to students.
When ordinary regression doesn't work very well, then I'm reluctant to strain with much more. I might be willing in some circumstances to use what L. Breiman (I respect Breiman, Loeve student) has in his Classification and Regression Trees.