Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Given that the video is fully interactive and lets you move around (in a “world” if you will) I don’t think it’s a stretch to call it a world model. It must have at least some notion of physics, cause and effect, etc etc in order to achieve what it does.


No, it actually needs none of that.


How would it do what it does without those things?


Like all these models work, by simple interpolation.


But how does it interpolate?


Pixel by pixel, time-slice by time-slice, in a 2D+T convolution. You provide enough examples of videos of changing point-of-view, and the model reproduces what it is given.


Yes, it reproduces what it is given by modelling the rules of physics, geometry, etc.

For example, image generators like stable diffusion carry strong representations of depth and geometry, such that performant depth estimation models can be built out of them with minimal retraining. This continues to be true for video generation models.

Early work on the subject: https://arxiv.org/pdf/2409.09144


What? No, it does no such thing. Study the architecture. Pixels in. Pixels out.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: