It's not a scholarly article but a blog post but you're still right to be frustrated at the very bad writing. I do get the jargon, despite myself, so I can translate: the authors of the blog post claim that machine learning for autonomous robotics is "solved" thanks to an instance of V-JEPA 2 trained on all videos on youtube. It isn't, of course, and the authors themselves point out the severe limitations of the otherwise promising approach (championed by Yan LeCun) when they say, in a notably more subdued manner:
>> the model is basically a diva about camera positioning. move the camera 10 degrees and suddenly it thinks left is right and up is down.
>> in practice, this means you have to manually fiddle with camera positions until you find the sweet spot. very scientific. much engineering.
>> long-horizon drift
>> try to plan more than a few steps ahead and the model starts hallucinating.
That is to say, not quite ready for the real world, V-JEPA 2 is.
But for those who don't get the jargon there's a scholarly article linked at the end of the post that is rather more sober and down-to-earth:
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
https://arxiv.org/abs/2506.09985
In other words, some interesting results, some new SOTA, some incremental work. But lots of work for a big team of a couple dozen researchers so there's good stuff in there almost inevitably.