Thats especially encouraging to me because those are all about generalization.
5 and 5.1 both felt overfit and would break down and be stubborn when you got them outside their lane. As opposed to Opus 4.5 which is lovely at self correcting.
It’s one of those things you really feel in the model rather than whether it can tackle a harder problem or not, but rather can I go back and forth with this thing learning and correcting together.
This whole releases is insanely optimistic for me. If they can push this much improvement WITHOUT the new huge data centers and without a new scaled base model. Thats incredibly encouraging for what comes next.
Remember the next big data center are 20-30x the chip count and 6-8x the efficiency on the new chip.
I expect they can saturate the benchmarks WITHOUT and novel research and algorithmic gains. But at this point it’s clear they’re capable of pushing research qualitatively as well.
> 5 and 5.1 both felt overfit and would break down and be stubborn when you got them outside their lane. As opposed to Opus 4.5 which is lovely at self correcting.
This is simply the "openness vs directive-following" spectrum, which as a side-effect results in the sycophancy spectrum, which still none of them have found an answer to.
Recent GPT models follow directives more closely than Claude models, and are less sycophantic. Even Claude 4.5 models are still somewhat prone to "You're absolutely right!". GPT 5+ (API) models never do this. The byproduct is that the former are willing to self-correct, and the latter is more stubborn.
Opus 4.5 answers most of my non-question comments with ‘you’re right.’ as the first thing in the output. At least I’m not absolutely right, I’ll take this as an improvement.
Hah, maybe 5th gen Claude will change to "you may be right".
The positive thing is that it seems to be more performative than anything. Claude models will say "you're [absolutely] right" and then immediately do something that contradicts it (because you weren't right).
Gemini 3 Pro seems to have struck a decent balance between stubbornness and you're-right-ness, though I still need to test it more.
5.2 seems worse on overfitting for esoteric logic puzzles in my testing. Tests using precise language where attention has to be paid to use the correct definition among many for a given word. It charges ahead with wrong definitions in a far lower accuracy and worse way now.
Slight tangent yet I think is quite interesting... you can try out the ARC-AGI 2 tasks by hand at this website [0] (along with other similar problem sets). Really puts into perspective the type of thinking AI is learning!
Thats especially encouraging to me because those are all about generalization.
5 and 5.1 both felt overfit and would break down and be stubborn when you got them outside their lane. As opposed to Opus 4.5 which is lovely at self correcting.
It’s one of those things you really feel in the model rather than whether it can tackle a harder problem or not, but rather can I go back and forth with this thing learning and correcting together.
This whole releases is insanely optimistic for me. If they can push this much improvement WITHOUT the new huge data centers and without a new scaled base model. Thats incredibly encouraging for what comes next.
Remember the next big data center are 20-30x the chip count and 6-8x the efficiency on the new chip.
I expect they can saturate the benchmarks WITHOUT and novel research and algorithmic gains. But at this point it’s clear they’re capable of pushing research qualitatively as well.