justk's comments

justk · on July 4, 2024

From (1) On the other hand, if the variation between the group means and the grand mean is small, and the variation within groups is large, this suggests there are no real differences in the group means i.e. the variations we observe is just sampling variation.

The above is in the context of analysis of variance. In our example the means in each state are 0.55 and 0.45 and the total mean is 0.50 so first summand is small but the variances in the red and blue states are both 0.247, large summand, so the variations we observe are just sampling variations. Hence the state factor is not important and that explains the low R^2 value. Note that in each state the predicted value for the model is the group mean of that group. So analysis of variance explains that the OP result is not a paradox or something strange.

https://saestatsteaching.tech/analysis-of-variance

justk · on July 4, 2024

The math is correct, but I think the model used is not correct since it doesn't reflect that the variable s is dichotomous so rather a mixed model should be used. If we continue thinking that s is continuous we could think of this example: s=state is encoded as a continuous variable between -1 and 1 here people change state frequently and -1 reflects the person will vote in the blue state with probability 1 and s=1 that the person will vote in the red state with probability 1 while s=0 means that the person has the same probability of voting in the red or blue states. When s is near zero the model is not able to predict the preferences of the voter and this is the reason of the low predictive power of this model for a continuous s. The extreme cases s=-1 or s=1 could be rare for populations that move from one state to the other frequently so the initial intuition is misleaded to this paradox.

mtts · on July 4, 2024

This.

R2 is not the correct measure to use.

This article is a perfect example of the principle that simply doing math and getting results is not necessarily meaningful.

kgwgk · on July 4, 2024

R² is a measure like any other. In this case it measures the relative reduction in MSE - which is low because the prediction of individual votes remains quite bad even if the state is taken into account.

Does another measure give substantially different results?

justk · on July 4, 2024

I think that you are using here a different definition of R^2 for example the way you are thinking of R^2 doesn't allow for an interpretation of the constant term used in the linear model for the formula of the R^2 to be true. What you are thinking is R^2 = 1 - mean(the variance in each state)/(total variance), but that is not the definition of R^2 for a linear model.

As the user fskfsk.... says in another comment, here the constant term explains a lot of the variance so that the slope terms contains less information, that is not available using your definition or idea of R^2

kgwgk · on July 4, 2024

> I think that you are using here a different definition of R^2

Different from what?

According to wikipedia:

The most general definition of the coefficient of determination is R^2 = 1 - SS_res / SS_tot ( = 1 - 0.2475 / 0.25 = 0.01 in this case)

Edit to clarify the definition above:

SS_res is the sum of squares of residuals (also called the residual sum of squares) ∑( y_i - predicted_i )^2

SS_tot is the total sum of squares (proportional to the variance of the data) ∑( y_i - ∑y_i/N )^2

justk · on July 4, 2024

The most general definition of R^2 can produce a result that is negative, and we are talking about a paradox related to values of R^2 that one should expect. So it is common to use linear models and linear regression. I don't know if the variance of the total population can be computed as the sum of the variances in each state, and state is not a continuous variable.

The population variance is the sum of the Between Group Variance and the Within Group Variance weighted by the number of elements in each group.

kgwgk · on July 4, 2024

I don't understand what you mean. I'll just note that the value of R² in this case is 1% as the blog post explains and the code below confirms.

  > data <- data.frame(state = rep(c(0, 1), each=20), pref = c(rep(0, 11), rep(1, 9), rep(0, 9), rep(1, 11)))
  > summary(lm(pref ~ state, data = data))$r.squared
  0.01

justk · on July 4, 2024

The math is correct, I am referring to your comment: >> R² is a measure like any other. In this case it measures the relative reduction in MSE - which is low because the prediction of individual votes remains quite bad even if the state is taken into account.

I may be reading too much from your comment, but it seems that you relate R^2 to the reduction in the prediction error in each state, so it seems you are thinking about the formula of computing the R^2 as the (average variance in each state)/(total variance), that I think is not correct in general since at least it should require the total variance to be the sum of the variances in each state. If you based your ideas in that formula then your intuition is not correct, that is my point. When I apply R^2 I am thinking in a multivariable linear model with continuous variables, and this is not the case. I should measure this problem by how the entropy change when we apply the information about the state, something like the cross entropy using the total distribution and the distribution by states.

kgwgk · on July 4, 2024

When I wrote "In this case it measures the relative reduction in MSE" I meant exactly that.

The mean squared error of the baseline model which doesn't include the state as a regressor is 0.25 (it predicts always 0.5 - it's off by 0.5 in every case).

The mean squared error of the model which includes the state as a regressor is 0.2475 (it predicts 0.45 or 0.55 depending on the state - in both cases it's off by 0.45 with 55% probability and it's off by 0.55 with 45% probability).

The mean squared error is directly related to variance when the predictor is unbiased. The ratio of the sum of squares is the same as the ratio of the mean square errors.

Edit: http://brenocon.com/rsquared_is_mse_rescaled.pdf

"R2 can be thought of as a rescaling of MSE, comparing it to the variance of the outcome response."

https://dabruro.medium.com/you-mention-the-average-squared-e...

"Also it is worth mentioning that R-squared (coeff. of determination) is a rescaled version of MSE such that 100% is perfection and 0% implies the same MSE that you would get by simply always predicting the overall mean of the dataset."

justk · on July 4, 2024

Let d1 = data[state==0] and d2 = data[state==1], then var(d1$pref) = 0.26, var(d2$pref)= 0.26 and var(d$pref)= 0.256 (using R and one of your dataframes), so the intuition is that knowing the state does not give information about the preferences of the voters, so this suggests that any model based on state should give poor results and so having R^2=1 is not a big paradox in this case.

There must be a formula to compute R^2 from variances both among states and inside states but anyway, when the variances inside any state are bigger that the total variance that should imply that the feature that divides the population in groups is of little value for prediction so it should have a small R^2 value.

kgwgk · on July 4, 2024

That seems more or less what I said in the comment you replied to: the prediction of individual votes remains quite bad even if the state is taken into account. That's why the relative reduction in MSE is low. That's why the R² is low. I don't think there is any paradox.

I was replying to someone who claimed that "R2 is not the correct measure to use. This article is a perfect example of the principle that simply doing math and getting results is not necessarily meaningful." I've not seen any comment from anyone getting "different results" with a different measure.

Edit: You used var(...) which includes a factor N/N-1 and doesn't give exactly the total sum of squares.

The example dataframe contains 40 observations (20 per state) and you get higher variance estimate for the subsamples than for the aggregate sample but if you put toghether a few copies of the data (for example doing "data <- rbind(data, data, data, data, data)") even the adjusted (unbiased) estimator of the variance is lower for the states.

You can calculate the "exact" values yourself doing (x-mean(x))^2 or undoing the adjustment:

  > var(data$pref)*39/40
  [1] 0.25
  > var(data[data$state==0, "pref"])*19/20
  [1] 0.2475
  > var(data[data$state==1, "pref"])*19/20
  [1] 0.2475

> when the variances inside any state are bigger that the total variance

They are not. But you're right in that a small difference shows that dividing the population in groups is of little value for prediction and that's why the R^2 value is small.

justk · on July 4, 2024

The correcting factor n/(n-1) in R is what explains my paradox about the law of total variance Var(Y) = E(var(Y|X)) + Var(E(X|Y)), I was obtaining result that don't match this formula because I corrected all the variances with the factor 20/19 but the total variance should have the factor 40/39 just like you pointed. Thanks for the comments and the correction.

I just added another comment that relates analysis of variance to this post to show that there is no real paradox here.

Finally, the formula for the total variance above is related to my intuition that having some information (having the data for each state) should make the means of the variances in each group smaller that the total variance, because variance is related to lack of information. But analysis of variance suggests (see other comment of mine) that the state factor is not representative because the high variance in each group (each state) and the low difference between the groups means and the total mean.

jncfhnb · on July 4, 2024

A mixed model is not relevant here. A simple linear regression with one variable will achieve exactly the same results. Coding it as -1 and 1 has no difference to coding it as 0 and 1. You just stuff the rest into the intercept.

You would also want to be predicting 0.45 and 0.55 not 1 and 0 because we solve for squared error.

justk · on July 4, 2024

The reason of the apparent paradox: a) in this case the model is a mixed model b) second the variable are nominal so you have to select one of the pseudo R^2 models. For more information: (1) Pseudo R-squared: https://en.wikipedia.org/wiki/Pseudo-R-squared (2) R squared for mixed models – the easy way https://ecologyforacrowdedplanet.wordpress.com/2013/08/27/r-...

c) The R^2 used with a linear model requires a constant term, in this case the constant term or bias explains a lot about preferences (almost 50/50) so there is less information available for the slope term.

Hope this helps.

vcdimension · on July 4, 2024

@justk What you're talking about might make sense if there were more independent variables to consider, but in this case there's only one, state. So in fact you could say that there are two conditional linear models in the example; one for the first state (state=0), and one for the second (state=1). The model does the best job with the information available (state).

justk · on July 4, 2024

Sorry, I edited my post several times and finally choose a short form with links other sources. If you fix state=1 then there are no more random variables so the R^2 doesn't have any meaning. Just for fun, what the model should predict for state = 0.5?, that corresponds to a person that is 50% in the red state and 50% in the blue state, I think a mixed model is appropriated here when the state variable is discrete, so that each value of the state variable represents a different part of the population, the other model should be used when people move a lot and change frequently the state where they vote in, but in that case you should have to consider the fluctuations in the total population in each state at the time of voting.

vcdimension · on July 4, 2024

@justk The R^2 value of 0.01 calculated on that webpage uses both states, not just one: the variance of the predicted values across both states is 0.55^3+0.45^3 - (0.55^2+0.45^2) ≃ 0.497 ≃ 0.5 I don't think it makes sense to use a mixed model in this case since the variance is the same for each state. A mixed model is used when the observations have some structured heteroskedasticity, i.e. different variances for different values of the independent variables.

fjkdlsjflkds · on July 4, 2024

> The R^2 used with a linear model requires a constant term, in this case the constant term or bias explains a lot about preferences (almost 50/50) so there is less information available for the slope term.

This explains the paradox, basically. When you take the null model "preference = 50%" (i.e. intercept-only model), there simply isn't much residual variance left for the linear model to explain.

That's why you get an R^2 = 1 if you use the "R^2 = rho(state, preference)^2" formula (you are ignoring the role of the intercept in explaining most of the variance, and exploiting the translation-invariance of the Pearson correlation) vs. you getting an R^2 = 0.01 when you use the (more correct) "R^2 = explained variance / total variance" formula.

TL;DR: It makes sense to get a very low R^2 when it is the intercept and not the predictor that is explaining most of the variance.

kgwgk · on July 4, 2024

> there simply isn't much residual variance left for the linear model to explain.

I'd say that there is still quite a lot of residual variance to explain. You need a baseline - the worst choice would be to predict 0 (or 1) and the mean squared error would be 0.5. Using 0.5 as baseline halves the mean squared error to 0.25.

fjkdlsjflkds · on July 4, 2024

This, of course, will depend on how you code your variables, but if you try to fit a null, intercept-only and predictor-only model, you get this as residual variance:

> data <- data.frame(state = c(0, 1), pref = c(0.45, 0.55))

> sum(residuals(lm(pref ~ 0, data = data))^2) # null model

[1] 0.505

> sum(residuals(lm(pref ~ 1, data = data))^2) # intercept-only model

[1] 0.005

> sum(residuals(lm(pref ~ state + 0, data = data))^2) # predictor-only model

[1] 0.2025

So, it seems clear that you only get a "perfect" prediction with the full (intercept + predictor) model mostly because of the intercept (which explains (0.505-0.005)/0.505 = 0.99 = 99% of the variance).

Thus, it makes sense that the predictor is only explaining the rest (i.e. 1%) of the variance... hence, the R^2 = 0.01

justk · on July 4, 2024

This is a little strange, you are using a data.frame with only two points so any linear model with two different parameters will be 100% accurate. This is the line that connect two points.

fjkdlsjflkds · on July 4, 2024

An affine model, yes (as I mentioned in my comment). A linear model, no ;)

But, anyway, seems like I interpreted things incorrectly.

kgwgk · on July 4, 2024

Your calculation is not directly related to the model (and associated R²) discussed in the article which are about the prediction of individual votes using the state as predictor - not state averages using the state as predictor.

Maybe I'm completely missing your point but the calculations in the blog post are, adapting your code (I think you meant mean where you wrote sum):

  > data <- data.frame(state = rep(c(0, 1), each=20), pref = c(rep(0, 11), rep(1, 9), rep(0, 9), rep(1, 11)))

  > mean(residuals(lm(pref ~ 0, data = data))^2) # null model [NOT IN THE BLOG POST]
  [1] 0.5

  > mean(residuals(lm(pref ~ 1, data = data))^2) # BASELINE intercept-only model
  [1] 0.25

  > mean(residuals(lm(pref ~ state + 0, data = data))^2) # predictor-only model [NOT IN THE BLOG POST]
  [1] 0.34875

  > mean(residuals(lm(pref ~ state, data = data))^2) # MODEL
  [1] 0.2475

  > summary(lm(pref ~ state, data = data))$r.squared # MODEL
  0.01

The blog post is about what you call "intercept-only" model (MSE 0.25) and the full model (MSE 0.2475), the R² is (0.25-0.2475)/0.25=0.01. His calculation is slightly different: instead of 0.25-0.2475 he calculates directly 0.05^2 which is the variance of the predictions (in this case the total variance 0.25 can be decomposed as the variance of the errors 0.2475 plus the variance of the predictions 0.0025).

fjkdlsjflkds · on July 4, 2024

(After re-reading the blog post with more care...) you are right, and thanks for the correction.

Either way, the point stands... the improvement in using a full linear model (that predicts 0.45 or 0.55, depending on state) is marginal compared to the baseline model that always predicts 0.50, as you demonstrate with your code.

To me, this doesn't seem paradoxical... the predictor is indeed providing little information over the "let's flip a coin to predict someone's voting preference" null/baseline predictor, since people's preferences (in aggregate) are almost equivalent to "flipping a coin".

note: I meant "sum", but it's the same, since the ratio between sums of squares is equivalent to the ratio between mean squares

kgwgk · on July 4, 2024

> Either way, the point stands... the improvement in using a full linear model (that predicts 0.45 or 0.55, depending on state) is marginal compared to the baseline model that always predicts 0.50

Yes, I think we don't disagree. I was just puzzled by the "little variance left to explain" remark.

> note: I meant "sum", but it's the same, since the ratio between sums of squares is equivalent to the ratio between mean squares

You're right, sum of squares made sense if it was just for the ratio.

fjkdlsjflkds · on July 4, 2024

Thanks for taking the time to clarify my confusion.

It's not that there is "little variance left to explain", but actually that (no matter what) there will always be too much variance left to be explained, when the response is Bernoulli-distributed and the parameter is not too far from 0.5 (i.e., the data generating process is like flipping a slightly loaded coin).

If you use the expected value to predict the Bernoulli variable, you will always be somewhat wrong (0.45 and 0.55 are both far from 0 and from 1, which are the only possible responses).

If you use a binary response to predict, you will quite often be very wrong, even if you are right on average, and even if your prediction is to generate Bernoulli-distributed samples from the exact same distribution (i.e., you know exactly how the coin is loaded/biased and you can exactly replicate its data generation process).

So... yeah... no "paradox" ;)