> Pure vision will never be enough because it does not contain information
Say it louder for those in the back!
But actually there's more to this that makes the problem even harder! Lack of sensors is just the beginning. There's well known results in physics that:
You cannot create causal models through observation alone.
This is a real pain point for these vision world models and most people I talk to (including a lot at the recent CVPR) just brush this off as "we're just care if it works." Guess what?! Everyone that is pointing this out also cares that it works! We need to stop these thought terminating cliches. We're fucking scientists.
Okay, so why isn't observation enough? It's because you can't differentiate alternative but valid hypotheses. You often have to intervene! We're all familiar with this part. You control variables and modify one or a limited set at a time. Experimental physics is no easy task, even for things that sound rather mundane. This is in fact why children and animals play (okay, I'm conjecturing here).
We need to mention chaos here, because it's the easiest way to understand this. There's many famous problems that fall into this category like the double pendulum, 3 Body Problem, or just fucking gas molecules moving around. Let's take the last one. Suppose you are observing some gas molecules moving inside a box. You measure their positions at t0 and at T. Can you predict their trajectories between those time points? Surprisingly, the answer is no. You can only do this statistically. There's probably paths but not deterministic (this same logic is what leads to multiverse theory btw). But now suppose I was watching the molecules too, but I was continuously recording between t0 and T. Can I predict the trajectories? Well, I don't need to, I just write it down.
Now I hear you, you're saying "Godelski, you observed!" But the problem with these set of problems is that if you don't observe the initial state you can't predict moving forwards and if you don't have very precise observation intervals you are hit with the same problem. I you turn around while I start a double pendulum you can have as much time as you want when you turn back around, you won't be able to model its trajectories.
But it gets worse still. There are confounding variables. There is coupling. Difficult to differentiate hypotheses via causal ordering. And so so much more. If you ever wonder why physicists do so much math it's because doing that is a fuck ton easier than doing the whole set of testing and then reverse engineering the equations from those observations. But in physics we care about counterfactual statements. In F=ma we can propose new masses and new accelerations and rederive the results. That's the what it is all about. Your brain does an amazing job at this too! You need counterfactual modeling to operate in real world environments. You have to be able to ask and answer "what happens if that kid runs into the street?"
I highly suggest people read The Relativity of Wrong [0]. Its a short essay by Isaac Asimov that can serve as a decent intro, though far from complete. I'm suggesting it because I don't want people to confuse "need counterfactual model" with "need the right answer." If you don't get into metaphysics, these results will be baffling.[1] It is also needed to answer any confusion you might have around the aforementioned distinction.
Tldr:
if you could do it from observation alone, physics would have been solved a thousand years ago
There's a lot of complexity and depth that is easy to miss with the excitement, but it still matters.
I'm just touching the surface here too, and we're just talking about mechanics. No quantum needed, just information loss
[0] https://hermiene.net/essays-trans/relativity_of_wrong.html
[1] maybe this is why there are so few physicists working on the world modeling side of ML. At least, using that phrase...