> Yeah sure its a projection but so what. Why and when should a L2 norm projecti...

srean · on June 3, 2017

No no no you are thinking this one a little wrong.

Yes Pythagorean identity, or Pythagorean decomposition, whatever you want to call it, the orthogonality properties etc etc make L2 super convenient to apply and reason about, but that does not mean it is a suitable performance metric to use. This is again the phenomena of searching for the answer where its well lit vs where the answer lies.

The problem really lies in the tails of the errors, although it might not be immediately apparent. You say variance covariance etc etc, but it does not take much for RVs not to possess them. Think about it, Chebyshev's inequality will make it clear what decay you need for those to exist.

Yes you are right choice of L_2 by itself makes no assumptions. But once you start analyzing the performance of the estimator it will show up, particularly when Fisher information is no longer strongly convex. Cramer Rao becomes a vacuous bound in this case. The other problem is L2 balls become infinitely larger than say the L1 ball as dimensionality increases. So L2 error stops being very good at localizing a vector.

The problem is L_2 is not a good metric to measure inaccuracy with, it is super convenient to use though.

Let me demonstrate one of its well known problems. Say the error residual has tails that does not decay as fast as an exponential (MGF does not exist around 0). You can then show that the accuracy can be arbitrarily bad. No need for super ugly densities, something as benign as just a mixture of two Gaussians with the same expectation will break it. I am just restating Huber's robustness argument here in a different way.

Gauss himself was well aware of the problem and chose it for convenience. Even his peers were well aware of situations when L1 made more sense than L2. Unfortunately optimization theory was not well developed at that time to use L1 based methods. But now we can.

BTW you might find it interesting that the Pythagorean property is not limited to L2 squared. The set of all 'divergences' for which it holds is the Bregman divergence family. Its an if and only if condition. This is a result that came from the ML community. One side of the implication was known though. Bregman divergences are again the likelihood ratios of exponential family densities, (also called Darmois Koopman family in older literature), hence has strong connections with NP lemma as well. What I find mind boggling is that Bregman divergence had its origin in convex optimization literature ! Its origins had absolutely nothing, nothing to do with probability and stats. Its amazing when two separate fields of Math make contact in these ways.

BTW if you dont mind sharing your suitably anonymized mail, I am srean.list at gmail. It will be pleasure to discuss math with you. I find it very helpful when i can stress test an idea aginst a human oracle. HN might not be a forum well suited for this.

graycat · on June 3, 2017

Yes, now we can also do best L^1 approximations. IIRC -- I'm in a hurry this morning -- it's a linear programming problem or some such but I haven't thought about that in decades.

You bring up a lot of stuff I've never reviewed.

> but that does not mean it is a suitable performance metric to use.

If the plane do the L^2 projection on is really close, then the error is really small, and what's not to like? Really close is good enough for me.

Right, there are three biggie choices, L^1 (minimize the sum of absolute values of the errors), L^2 (minimize the sum of squared errors), and L^infinity (minimize the worst error).

L^2's fine with me, good enough for government work, okay first cut, day in and day out.

If in some particular case L^2 has some problems, then maybe something like regression is not the right tool for the problem.

Gee, we do a lot of L^2: We get orthogonal components so that we can get L^2 cafeteria style, just pick the components we want. E.g., in filtering stochastic processes, we take a Fourier transform -- that is finding the coefficients of the sample path on the sine-cosine orthogonal components. Or we do a convolution, which is the same thing in the end. The approximation we get is an L^2 approximation.

So, with JPG -- sure, it's L^2. Right, JPG does funny stuff nearly lines. Life's not perfect!

For some things, sure we want L^infinity: E.g., if we want to use a quotient of polynomials to approximate the usual special functions, then we want to minimize the worst error and do what is called Chebyshev approximation, but this is very specialized.

srean · on June 3, 2017

> it's a linear programming problem

Indeed it is. Lets catchup more sometime. Here's my email again srean.list on gmail

Totally agreed there is lot to like about L2 but there are plenty situations where it is terrible (in fact some of them are on one topic of your interest: monitoring server farms). In some of those situations L1 or a combination of L1 and L2 helps a lot.

graycat · on June 3, 2017

> You say variance covariance etc etc, but it does not take much for RVs not to possess them

In practice, essentially always, for real random variables X, Y, all of the following exist and are finite:

E[X], E[|X|], E[X^2], E[XY]

Var(X) = E[(X - E[X])^2]

Std(X) = Var(X)^(1/2)

Cov(X,Y) = E[(X - E[X]) (Y - E[Y])]

Cor(X,Y) = Cov(X,Y)/(Std(X) Std(Y))

E[Y|X]

E[X] and Std(X)are useful quite broadly and that we find/estimate them does not mean that we are working with a Gaussian distribution.

If the above is not true, then there is something bizarre and/or pathological, the ultimate edge case, and we need to review what we are doing.

srean · on June 4, 2017

Unfortunately its not true in many cases I have seen, essentially because the tail of the error does not fall fast enough. Technically, for all bounded RVs all moments are bounded, but in some of these situations the variance is so high that its infinite for practical purposes.

All you need is the tail to fall slower than a quadratic.