We accidentally solved robotics by watching 1M hours of YouTube

Extremely oversold article.

> the core insight: predict in representation space, not pixels

We've been doing this since 2014? Not only that, others have been doing it at a similar scale. e.g. Nvidia's world foundation models (although those are generative).

> zero-shot generalization (aka the money shot)

This is easily beaten by flow-matching imitation learning models like what Pi has.

> accidentally solved robotics

They're doing 65% success on very simple tasks.

The research is good. This article however misses a lot of other work in the literature. I would recommend you don't read it as an authoritative source.