There's a few things to consider here: - there are many aspects to the video tha...

There's a few things to consider here: - there are many aspects to the video that are not convincing, indicating these videogen models do not grok the world the same way a typical human does - A 6 year old child is almost certainly incapable of recreating pixel level fidelity video footage, yet understands the world extremely well... far beyond what current robotics is capable of.

The two facts above should be indicative that predicting noise (as with DDPM diffusion models), or predicting pixel level (or even VAE latent "pixel") information is probably not the optimal path to world understanding. Probably not even a good path to good world models.