The real world is governed by physics, so isn't "next state prediction" a suffic...

The real world is governed by physics, so isn't "next state prediction" a sufficient eval function that forces it to internalize a world model? And increasing the timespan over which you predict requires an increasing amount of "intelligence" because it requires modeling the real-world behavior of constituent subsystems that are often black-boxes (e.g. if there is a crow on a road with a car approaching, you can't just treat it as a physics simulation, you need to know that crows are capable of flying to understand what is going to happen).

I don't see how this is any less structured than the CLM objective of LLMs, there's a bunch of rich information there.