Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Yeah sure its a projection but so what. Why and when should a L2 norm projection be useful.

We want the answer, and more generally a predictive model.

We use the data we have to define a vector subspace. We use that subspace to creep up on the answer we want. Finally to get the answer, we project onto the subspace we have settled on. We CAN do the projection even though we don't have the answer. We take that projection on to the subspace, that point in that subspace, as our approximation to the answer.

We use L^2 because that projection, with the Pythagorean theorem, is the closest in L^2.

Being close in L^2 is a good thing: L^2 is a Hilbert space, and being Cauchy convergent means being convergent in L^2, and then there is a subsequence that converges almost surely, with probability 1, exactly, etc., and in practice it's exactly. A sequence that converges in L^2 but not exactly is pathological.

Then as we use more variables, the vector subspace gets closer to the answer and we get a better projection although maybe not a better fitting model.

To me, tails of distributions have next to nothing to do with it or the use of L^2.

In particular, the basic regression math, the normal equations, the Pythagorean theorem

total sum of squares = regression sum of squares + error sum of squares

makes no assumptions at all about distributions (might want to argue that need to be in L^2, that is, for each random variable X have E[X^2] finite, or maybe at least E[|X|] finite, that is, L^1) certainly does not assume a Gaussian.

Moreover the matrix in the normal equations has variances and covariances, but these are close to just the inner product in the Hilbert space L^2 and do NOT assume a Gaussian or anything about a distribution except merely that the expectations exist which in practice is a meager assumption, essentially always the case.

Just because are working with variances and covariance does NOT mean that are assuming a Gaussian.

Also notice that nowhere do we need to assume that our data is not discrete. Discrete data is still fine.

If in addition do have an appropriate Gaussian assumption, then at the end of the usual normal equations stuff, can get an F ratio and some t-tests that usually can help evaluate the quality of the fitting and the importance of the coefficients.

And for the model, we can get confidence intervals on the predicted values -- sure, will need a Gaussian assumption here, but there might be other approaches and otherwise we can hope or look for robust estimators.

L^2's fine with me.

More can be done, but the above is the first cut reason for using subspaces and L^2 projections although this explanation is rarely taught to students.

When ordinary regression doesn't work very well, then I'm reluctant to strain with much more. I might be willing in some circumstances to use what L. Breiman (I respect Breiman, Loeve student) has in his Classification and Regression Trees.



No no no you are thinking this one a little wrong.

Yes Pythagorean identity, or Pythagorean decomposition, whatever you want to call it, the orthogonality properties etc etc make L2 super convenient to apply and reason about, but that does not mean it is a suitable performance metric to use. This is again the phenomena of searching for the answer where its well lit vs where the answer lies.

The problem really lies in the tails of the errors, although it might not be immediately apparent. You say variance covariance etc etc, but it does not take much for RVs not to possess them. Think about it, Chebyshev's inequality will make it clear what decay you need for those to exist.

Yes you are right choice of L_2 by itself makes no assumptions. But once you start analyzing the performance of the estimator it will show up, particularly when Fisher information is no longer strongly convex. Cramer Rao becomes a vacuous bound in this case. The other problem is L2 balls become infinitely larger than say the L1 ball as dimensionality increases. So L2 error stops being very good at localizing a vector.

The problem is L_2 is not a good metric to measure inaccuracy with, it is super convenient to use though.

Let me demonstrate one of its well known problems. Say the error residual has tails that does not decay as fast as an exponential (MGF does not exist around 0). You can then show that the accuracy can be arbitrarily bad. No need for super ugly densities, something as benign as just a mixture of two Gaussians with the same expectation will break it. I am just restating Huber's robustness argument here in a different way.

Gauss himself was well aware of the problem and chose it for convenience. Even his peers were well aware of situations when L1 made more sense than L2. Unfortunately optimization theory was not well developed at that time to use L1 based methods. But now we can.

BTW you might find it interesting that the Pythagorean property is not limited to L2 squared. The set of all 'divergences' for which it holds is the Bregman divergence family. Its an if and only if condition. This is a result that came from the ML community. One side of the implication was known though. Bregman divergences are again the likelihood ratios of exponential family densities, (also called Darmois Koopman family in older literature), hence has strong connections with NP lemma as well. What I find mind boggling is that Bregman divergence had its origin in convex optimization literature ! Its origins had absolutely nothing, nothing to do with probability and stats. Its amazing when two separate fields of Math make contact in these ways.

BTW if you dont mind sharing your suitably anonymized mail, I am srean.list at gmail. It will be pleasure to discuss math with you. I find it very helpful when i can stress test an idea aginst a human oracle. HN might not be a forum well suited for this.


Yes, now we can also do best L^1 approximations. IIRC -- I'm in a hurry this morning -- it's a linear programming problem or some such but I haven't thought about that in decades.

You bring up a lot of stuff I've never reviewed.

> but that does not mean it is a suitable performance metric to use.

If the plane do the L^2 projection on is really close, then the error is really small, and what's not to like? Really close is good enough for me.

Right, there are three biggie choices, L^1 (minimize the sum of absolute values of the errors), L^2 (minimize the sum of squared errors), and L^infinity (minimize the worst error).

L^2's fine with me, good enough for government work, okay first cut, day in and day out.

If in some particular case L^2 has some problems, then maybe something like regression is not the right tool for the problem.

Gee, we do a lot of L^2: We get orthogonal components so that we can get L^2 cafeteria style, just pick the components we want. E.g., in filtering stochastic processes, we take a Fourier transform -- that is finding the coefficients of the sample path on the sine-cosine orthogonal components. Or we do a convolution, which is the same thing in the end. The approximation we get is an L^2 approximation.

So, with JPG -- sure, it's L^2. Right, JPG does funny stuff nearly lines. Life's not perfect!

For some things, sure we want L^infinity: E.g., if we want to use a quotient of polynomials to approximate the usual special functions, then we want to minimize the worst error and do what is called Chebyshev approximation, but this is very specialized.


> it's a linear programming problem

Indeed it is. Lets catchup more sometime. Here's my email again srean.list on gmail

Totally agreed there is lot to like about L2 but there are plenty situations where it is terrible (in fact some of them are on one topic of your interest: monitoring server farms). In some of those situations L1 or a combination of L1 and L2 helps a lot.


> You say variance covariance etc etc, but it does not take much for RVs not to possess them

In practice, essentially always, for real random variables X, Y, all of the following exist and are finite:

E[X], E[|X|], E[X^2], E[XY]

Var(X) = E[(X - E[X])^2]

Std(X) = Var(X)^(1/2)

Cov(X,Y) = E[(X - E[X]) (Y - E[Y])]

Cor(X,Y) = Cov(X,Y)/(Std(X) Std(Y))

E[Y|X]

E[X] and Std(X)are useful quite broadly and that we find/estimate them does not mean that we are working with a Gaussian distribution.

If the above is not true, then there is something bizarre and/or pathological, the ultimate edge case, and we need to review what we are doing.


Unfortunately its not true in many cases I have seen, essentially because the tail of the error does not fall fast enough. Technically, for all bounded RVs all moments are bounded, but in some of these situations the variance is so high that its infinite for practical purposes.

All you need is the tail to fall slower than a quadratic.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: