We accidentally solved robotics by watching 1M hours of YouTube

1. liendolucas ◴[30 Jun 25 09:25 UTC] No.44421182[source]▶

I didn't understand a single word about this post and what was supposed to be solved and had to stop reading.

Was this actually written by a human being? If so, the author(s) suffer from severe language communication problems. Doesn't seem to be grounded at least with reality and my personal experience with robotics. But here's my real world take:

Robotics is going to be partially solved when ROS/ROS2 becomes effectively exterminated and completely replaced by a sane robotics framework.

I seriously urge the authors to use ROS/ROS2. Show us, implementing your solution with ROS, pushing it to a repository and allow others to verify what you solved, maybe?. Suffer a bit with the framework and then write a real post about real robotics hands-on, and not just wander on fancy uncomprehensible stuff that probably no-one will ever do.

Then we can maybe start talking about robotics.

replies(4): >>44422507 #>>44423136 #>>44425734 #>>44427473 #

2. rage4774 ◴[30 Jun 25 12:36 UTC] No.44422507[source]▶

>>44421182 (TP) #

I totally agree with you. On the other hand the theory behind it -to combine image recognition to predict the outcome based on specific physical impacts- does sound intriguing and like a somewhat newer idea.

But besides that, you‘re totally right. It’s too „loose“ since to realize that idea the process would have to be way different (and properly explained)

3. w4 ◴[30 Jun 25 13:38 UTC] No.44423136[source]▶

>>44421182 (TP) #

It is readily understandable if you are fluent in the jargon surrounding state of the art LLMs and deep learning. It’s completely inscrutable if you aren’t. The article is also very high level and disconnected from specifics. You can skip to FAIR’s paper and code (linked at the article’s end) for specifics: https://github.com/facebookresearch/vjepa2

If I had to guess, it seems likely that there will be a serious cultural disconnect as 20-something deep learning researchers increasingly move into robotics, not unlike the cultural disconnect that happened in natural language processing in the 2010s and early 20s. Probably lots of interesting developments, and also lots of youngsters excitedly reinventing things that were solved decades ago.

replies(1): >>44427335 #

4. YeGoblynQueenne ◴[30 Jun 25 17:18 UTC] No.44425734[source]▶

>>44421182 (TP) #

It's not a scholarly article but a blog post but you're still right to be frustrated at the very bad writing. I do get the jargon, despite myself, so I can translate: the authors of the blog post claim that machine learning for autonomous robotics is "solved" thanks to an instance of V-JEPA 2 trained on all videos on youtube. It isn't, of course, and the authors themselves point out the severe limitations of the otherwise promising approach (championed by Yan LeCun) when they say, in a notably more subdued manner:

>> the model is basically a diva about camera positioning. move the camera 10 degrees and suddenly it thinks left is right and up is down.

>> in practice, this means you have to manually fiddle with camera positions until you find the sweet spot. very scientific. much engineering.

>> long-horizon drift

>> try to plan more than a few steps ahead and the model starts hallucinating.

That is to say, not quite ready for the real world, V-JEPA 2 is.

But for those who don't get the jargon there's a scholarly article linked at the end of the post that is rather more sober and down-to-earth:

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

https://arxiv.org/abs/2506.09985

In other words, some interesting results, some new SOTA, some incremental work. But lots of work for a big team of a couple dozen researchers so there's good stuff in there almost inevitably.

5. godelski ◴[30 Jun 25 20:12 UTC] No.44427335[source]▶

>>44423136 #

  > if you are fluent in the jargon surrounding state of the art LLMs and deep learning

It is definitely not following that jargon. Maybe it follows the tech influencer blog post jargon but I can definitively say it doesn't follow jargon used in research. Which, they are summarizing a research paper. Consequently they misinterpret things and use weird phrases like "actionable physics," which is self referential. "A" physics model is necessarily actionable. It is required to be a counterfactual model. While I can understand the rephrasing to clarify to a more general audience that's a completely different thing than "being fluent in SOTA work." It's literally the opposite...

Also, it definitely doesn't help that they remove all capitalization except in nouns.

6. godelski ◴[30 Jun 25 20:26 UTC] No.44427473[source]▶

>>44421182 (TP) #

  > Doesn't seem to be grounded at least with reality and my personal experience with robotics.

It also doesn't match my personal experience with physics nor ML, and I have degrees in both.

You cannot develop accurate world models through observation alone, full stop.

You cannot verify accurate world models through benchmarks alone, full stop.

These have been pain points in physics for centuries and have been the major pain point even before the quantum revolution. I mean if it were possible, we'd have solved physics long ago. You can find plenty of people going back thousands of years boldly claiming "there is nothing new to be learned in physics," yet it was never true and still isn't true even if we exclude quantum and relativity.

Side note: really the paper is "fine" but I wish we didn't put so much hype in academic writing. Papers should be aimed at other academics and not be advertisements (use the paper to write advertisements like IFLS or Quanta Magazine, but don't degrade the already difficult researcher-to-researcher communication). So I'm saying the experiments are fine and the work represents progress but it is over sold and the conclusions do not necessarily follow

Btw, the paper makes these mistakes too. It makes a very bold assumption that counterfactual models (aka a "world model") are learned. This cannot be demonstrated through benchmarking, it must be proven through interpretability.

Unfortunately, the tail is long and heavy... you don't need black swan events to disrupt these models and boy does this annoying fact make it easy to "hack" these types of models. And frankly, I don't think we want robots operating in the wild (public spaces, as opposed to controlled spaces like a manufacturing floor) if I can make it think an iPhone is an Apple with just a stickynote. Sure, you can solve that precise example but it's not hard to come up with others. It's a cat and mouse game, but remember, Jerry always wins.