Extremely oversold article.
> the core insight: predict in representation space, not pixels
We've been doing this since 2014? Not only that, others have been doing it at a similar scale. e.g. Nvidia's world foundation models (although those are generative).
> zero-shot generalization (aka the money shot)
This is easily beaten by flow-matching imitation learning models like what Pi has.
> accidentally solved robotics
They're doing 65% success on very simple tasks.
The research is good. This article however misses a lot of other work in the literature. I would recommend you don't read it as an authoritative source.