Frankly I think the 'latest' generation of models from a lot of providers, which...

Frankly I think the 'latest' generation of models from a lot of providers, which switch between 'fast' and 'thinking' modes, are really just the 'latest' because they encourage users to use cheaper inference by default. In chatgpt I still trust o3 the most. It gives me fewer flat-out wrong or nonsensical responses.

I'm suspecting that once these models hit 'good enough' for ~90% of users and use cases, the providers started optimizing for cost instead of quality, but still benchmark and advertise for quality.