"why didn't we think of this sooner?", asks the article. Not sure who the "we" is supposed to be, but the robotics community has definitely thought of this before. https://robo-affordances.github.io/ from 2023 is one pretty relevant example that comes to mind, but I have recollections of similar ideas going back to at least 2016 or so (many of which are cited in the V-JEPA2 paper). If you think data-driven approaches are a good idea for manipulation, then the idea of trying to use Youtube as a source of data (an extremely popular data source in computer vision for the past decade) isn't exactly a huge leap. Of course, the "how" is the hard part, for all sorts of reasons. And the "how" is what makes this paper (and prior research in the area) interesting.
replies(1):