We accidentally solved robotics by watching 1M hours of YouTube

1. dchftcs ◴[30 Jun 25 03:53 UTC] No.44419191[source]▶

Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.

For example, so that you don't crush a human when doing massage (but still need to press hard), or apply the right amount of force (and finesse?) to skin a fish fillet without cutting the skin itself.

Practically in the near term, it's hard to sample from failure examples with videos on Youtube, such as when food spills out of the pot accidentally. Studying simple tasks through the happy path makes it hard to get the robot to figure out how to do something until it succeeds, which can appear even in relatively simple jobs like shuffling garbage.

With that said, I suppose a robot can be made to practice in real life after learning something from vision.

replies(4): >>44419561 #>>44419692 #>>44420011 #>>44426961 #

2. namibj ◴[30 Jun 25 04:58 UTC] No.44419561[source]▶

>>44419191 (TP) #

If the robot already knows "how to" the happy path, the training difficulty falls severely at least if it can continue after a recovery.

replies(1): >>44419906 #

3. rocqua ◴[30 Jun 25 05:20 UTC] No.44419692[source]▶

>>44419191 (TP) #

On humans, you can generally see the force they apply by looking at strain.

replies(1): >>44419916 #

4. dchftcs ◴[30 Jun 25 05:58 UTC] No.44419906[source]▶

>>44419561 #

The tasks you do to recover from the failure is often different from the happy path. For example, the happy path of dumping garbage is carrying a garbage bag to a collection bin. The non-happy path is that the bin is overflowing and you have to put the bag on the ground, or if the bag leaks and you need to move to a new bag, or if the bag breaks entirely and you have to pick up the trash again.

But yeah, I think a better way to put it is that sampling the happy path would indeed make the failure case easier, but sampling just happy paths is far from sufficient from completing even some of the simplest human tasks with failure.

5. dchftcs ◴[30 Jun 25 06:01 UTC] No.44419916[source]▶

>>44419692 #

The error margins will be huge, and for small enough force (like the skinning part or handling fine mechanical stuff) there's basically almost zero signal.

6. carlosdp ◴[30 Jun 25 06:16 UTC] No.44420011[source]▶

>>44419191 (TP) #

> Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.

I'm not sure that's necessarily true for a lot of tasks.

A good way to measure this in your head is this:

"If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"

When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.

It therefore follows that robots should be able to learn with just RGB images too! Counterexamples would be things like grabbing an egg without crushing, perhaps. Though I suspect that could also be done with just vision.

replies(9): >>44420219 #>>44420289 #>>44420630 #>>44420695 #>>44420919 #>>44421236 #>>44423275 #>>44425473 #>>44427030 #

7. jpc0 ◴[30 Jun 25 06:46 UTC] No.44420219[source]▶

>>44420011 #

I think you vastly underestimate how difficult the task you are proposing would be without depth or pressure indication, even for a super intelligence like humans.

Simple concept, pick up a glass and pour its content into a vertical hole the approximate size of your mouth. Think of all the failure modes that can be triggered in the trivial example you do multiple times a day, to do the same from a single camera feed with no other indicators would take you hours to master and you already are a super intelligent being.

replies(3): >>44420596 #>>44420608 #>>44420928 #

8. moefh ◴[30 Jun 25 06:55 UTC] No.44420289[source]▶

>>44420011 #

> It therefore follows that robots should be able to learn with just RGB images too!

I don't see how that follows. Humans have trained by experimenting with actually manipulating things, not just by vision. It's not clear at all that someone who had gained intuition about the world exclusively by looking at it would have any success with mechanical arms.

replies(1): >>44423246 #

9. stavros ◴[30 Jun 25 07:53 UTC] No.44420596{3}[source]▶

>>44420219 #

If I have to pour water into my mouth, you can bet it's going all over my shirt. That's not how we drink.

replies(1): >>44420718 #

10. jrimbault ◴[30 Jun 25 07:54 UTC] No.44420608{3}[source]▶

>>44420219 #

A routine gesture I've done everyday for almost all my life: getting a glass out of the shelves and into my left hand. It seems like a no brainer, I open the cabinet with my left hand, take the glass with my right hand, throw the glass from my right hand to the left hand while closing the cabinet with my shoulder. Put the glass under the faucet with left hand, open the faucet with the right hand.

I have done this 3 seconds gesture, and variations of it, my whole life basically, and never noticed I was throwing the glass from one hand to the other without any visual feedback.

replies(1): >>44424435 #

11. jaisio ◴[30 Jun 25 07:56 UTC] No.44420630[source]▶

>>44420011 #

> When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.

And where does this intuition come from? It was buily by also feeling other sensations in addition to vision. You learned how gravity pulls things down when you were a kid. How hot/cold feels, how hard/soft feels, how thing smell. Your mental model of the world is substantially informed by non-visual clues.

> It therefore follows that robots should be able to learn with just RGB images too!

That does not follow at all! It's not how you learned either.

Neither have you learned to think by consuming the entirety of all text produced on the internet. LLMs therefore don't think, they are just pretty good at faking the appearance of thinking.

12. suddenlybananas ◴[30 Jun 25 08:06 UTC] No.44420695[source]▶

>>44420011 #

Humans have innate knowledge that help them interact with the world and can learn from physical interaction for the rest. RGB images aren't enough.

replies(1): >>44420722 #

13. jpc0 ◴[30 Jun 25 08:12 UTC] No.44420718{4}[source]▶

>>44420596 #

Except this is the absolutely most common thing humans do, and my argument is that that it will spill water all over but rather that it will shatter numerous glasses, knock them over etc all before it has picked up the glass.

The same process will be repeated many times trying to move the glass to its “face” and then when either variable changes, plastic vs glass, size, shape, location and all bets are off purely because there just plainly is the enough information

14. whatever1 ◴[30 Jun 25 08:12 UTC] No.44420722{3}[source]▶

>>44420695 #

Video games have shown that we can control pretty darn well characters in virtual worlds where we have not experienced their physics. We just look at a 2D monitor and using a joystick/keyboard we manage to figure it out.

replies(2): >>44421108 #>>44421256 #

15. abenga ◴[30 Jun 25 08:44 UTC] No.44420919[source]▶

>>44420011 #

Humans did not accumulate that intuition just using images. In the example you gave, you subconsciously augment the image information with a lifetime of interacting with the world using all the other senses.

replies(1): >>44422133 #

16. var_cw ◴[30 Jun 25 08:45 UTC] No.44420928{3}[source]▶

>>44420219 #

The point is how much non-vision sensors vs pure vision, helps humans to be humans. Don't you think this point was proven by LLMs already that generalizability doesn't come from multi-modality but by scaling a single modality itself? And jepa is for sure designed to do a better job at that than an LLM. So no doubt about raw scaling + RL boost will kick-in highly predictable & specific robotic movements.

replies(2): >>44422229 #>>44427112 #

17. suddenlybananas ◴[30 Jun 25 09:11 UTC] No.44421108{4}[source]▶

>>44420722 #

Yeah but we already have a conception of what physics should be prior to that that helps us enormously. It's not like game designers are coming up with stuff that intentionally breaks our naïve physics.

replies(1): >>44427148 #

18. deadfoxygrandpa ◴[30 Jun 25 09:33 UTC] No.44421236[source]▶

>>44420011 #

counterpoint: think about all the tasks you could do with your hands and arms while your eyes are closed. i think its really a lot of stuff considering blind people can do the vast majority of things sighted people can do, and i suspect anything you could do with your eyes closed would be extremely difficult to do with a camera feed as the literal only sensory input

19. deadfoxygrandpa ◴[30 Jun 25 09:35 UTC] No.44421256{4}[source]▶

>>44420722 #

a game has very limited physics. like the buttons you press are pre-tuned to perform certain actions and you arent dealing with continuous nearly infinite possibilities with large ranges of motion, pressure, speed etc. like think about how difficult the game QWOP is because you mostly just have visual feedback

replies(1): >>44428032 #

20. amelius ◴[30 Jun 25 11:49 UTC] No.44422133{3}[source]▶

>>44420919 #

Yes, without extra information, manipulating everyday objects is probably as intuitive to robots as manipulating quantum scale molecules is for humans.

21. datameta ◴[30 Jun 25 12:02 UTC] No.44422229{4}[source]▶

>>44420928 #

> generalizability doesn't come from multi-modality but by scaling a single modality itself

Could you expand on what you mean by this?

22. amelius ◴[30 Jun 25 13:49 UTC] No.44423246{3}[source]▶

>>44420289 #

You'd use a two-step approach.

1. First create a model that can evaluate how well a task is going; the YT approach can be used here.

2. Then build a real-world robot, and train it by letting it do tasks, and use the first model to supervise it; here the robot can learn to rely on extra senses such as touch/pressure.

replies(1): >>44427051 #

23. corimaith ◴[30 Jun 25 13:52 UTC] No.44423275[source]▶

>>44420011 #

>"If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"

There are an infinite number of scenes that can be matched to one 2d picture. And what is a scene really? The last time I checked, RGB was not a good way of input in Computer Vision and rather relied on increasing levels of gradients via CNNs to build a compositional scene. None of that is paticularly translatable to how a LM works with text.

24. gregmac ◴[30 Jun 25 15:22 UTC] No.44424435{4}[source]▶

>>44420608 #

And you're used to the weight of the glass, which you instantly recognize when you pick it up. If it was a different weight than you were expecting, you'd probably slow down and be more deliberate.

If you were to just do the exact same robotic "throw" action with a glass of unexpected weight you'd maybe not throw hard enough and miss, or throw too hard and possibly break it.

25. ◴[30 Jun 25 16:55 UTC] No.44425473[source]▶

>>44420011 #

26. godelski ◴[30 Jun 25 19:28 UTC] No.44426961[source]▶

>>44419191 (TP) #

  > Pure vision will never be enough because it does not contain information

Say it louder for those in the back!

But actually there's more to this that makes the problem even harder! Lack of sensors is just the beginning. There's well known results in physics that:

  You cannot create causal models through observation alone.

This is a real pain point for these vision world models and most people I talk to (including a lot at the recent CVPR) just brush this off as "we're just care if it works." Guess what?! Everyone that is pointing this out also cares that it works! We need to stop these thought terminating cliches. We're fucking scientists.

Okay, so why isn't observation enough? It's because you can't differentiate alternative but valid hypotheses. You often have to intervene! We're all familiar with this part. You control variables and modify one or a limited set at a time. Experimental physics is no easy task, even for things that sound rather mundane. This is in fact why children and animals play (okay, I'm conjecturing here).

We need to mention chaos here, because it's the easiest way to understand this. There's many famous problems that fall into this category like the double pendulum, 3 Body Problem, or just fucking gas molecules moving around. Let's take the last one. Suppose you are observing some gas molecules moving inside a box. You measure their positions at t0 and at T. Can you predict their trajectories between those time points? Surprisingly, the answer is no. You can only do this statistically. There's probably paths but not deterministic (this same logic is what leads to multiverse theory btw). But now suppose I was watching the molecules too, but I was continuously recording between t0 and T. Can I predict the trajectories? Well, I don't need to, I just write it down.

Now I hear you, you're saying "Godelski, you observed!" But the problem with these set of problems is that if you don't observe the initial state you can't predict moving forwards and if you don't have very precise observation intervals you are hit with the same problem. I you turn around while I start a double pendulum you can have as much time as you want when you turn back around, you won't be able to model its trajectories.

But it gets worse still. There are confounding variables. There is coupling. Difficult to differentiate hypotheses via causal ordering. And so so much more. If you ever wonder why physicists do so much math it's because doing that is a fuck ton easier than doing the whole set of testing and then reverse engineering the equations from those observations. But in physics we care about counterfactual statements. In F=ma we can propose new masses and new accelerations and rederive the results. That's the what it is all about. Your brain does an amazing job at this too! You need counterfactual modeling to operate in real world environments. You have to be able to ask and answer "what happens if that kid runs into the street?"

I highly suggest people read The Relativity of Wrong [0]. Its a short essay by Isaac Asimov that can serve as a decent intro, though far from complete. I'm suggesting it because I don't want people to confuse "need counterfactual model" with "need the right answer." If you don't get into metaphysics, these results will be baffling.[1] It is also needed to answer any confusion you might have around the aforementioned distinction.

Tldr:

  if you could do it from observation alone, physics would have been solved a thousand years ago

There's a lot of complexity and depth that is easy to miss with the excitement, but it still matters.

I'm just touching the surface here too, and we're just talking about mechanics. No quantum needed, just information loss

[0] https://hermiene.net/essays-trans/relativity_of_wrong.html

[1] maybe this is why there are so few physicists working on the world modeling side of ML. At least, using that phrase...

27. godelski ◴[30 Jun 25 19:37 UTC] No.44427030[source]▶

>>44420011 #

  > because you as a human have really good intuition about the world.

This is the line that causes your logic to fail.

You introduced knowledge not obtained through observation. In fact, the knowledge you introduced is the whole chimichanga! It is an easy mistake to make, so don't feel embarrassed.

The claim is that one can learn a world model[0] through vision. The patent countered by saying "vision is not enough." Then you countered by saying "vision is enough if you already have a world model."

[0] I'll be more precise here. You can learn *A* world model, but it isn't the one we really care about and "a world" doesn't require being a self consistent world. We could say the same thing about "a physics", but let's be real, when we say "physics" we know which one is being discussed...

28. godelski ◴[30 Jun 25 19:40 UTC] No.44427051{4}[source]▶

>>44423246 #

You're agreeing with the parent btw. You've introduced a lot more than just vision. You introduced interventional experimentation. That's a lot more than just observation

replies(1): >>44427185 #

29. godelski ◴[30 Jun 25 19:48 UTC] No.44427112{4}[source]▶

>>44420928 #

  > LLMs already that generalizability

This is not a proven statement. In fact, it's pretty clear that they don't. They have some generalization but that's not enough for what you're inferring. The best way to show this is to carefully talk to an LLM about anything you have a lot of domain expertise in. Be careful to not give it answers (information leakage can sneak in subtly) and specifically look for those small subtle details (that's why it needs to be a topic you have expertise in). "The smell" will be right but the information won't.

Also, LLMs these days aren't trained on just language

30. godelski ◴[30 Jun 25 19:54 UTC] No.44427148{5}[source]▶

>>44421108 #

I mean they do but we often have generalized (to some degree) world models. So when they do things like change gravity, flip things upside down, or even more egregious changes we can adapt. Because we have contractual counterfactual models. But yeah, they could change things so much that you'd really have to relearn and that could be very very difficult if not impossible (I wonder if anyone has created a playable game with a physics that's impossible for humans to learn, at least without "pen and paper". I think you could do this by putting the game in higher dimensions.)

31. amelius ◴[30 Jun 25 19:57 UTC] No.44427185{5}[source]▶

>>44427051 #

What I describe is an unsupervised system.

What you say ("interventional") sounds like it's human-supervised.

But maybe I'm interpreting it in the wrong way, so please correct me if so.

replies(1): >>44428379 #

32. whatever1 ◴[30 Jun 25 21:23 UTC] No.44428032{5}[source]▶

>>44421256 #

I beg to disagree. I got introduced to brand new (to me) physics of flying airplanes by MS flight simulator. None of the rules I knew in real life applied (gravity matters only sometimes, height can be traded for speed etc). Yet learned how to fly.

And when I took real classes in a real Cessna, this experience was transferable (aka the flying model I had in my brain was very similar to the one I experienced with my full body in the cockpit).

33. godelski ◴[30 Jun 25 22:02 UTC] No.44428379{6}[source]▶

>>44427185 #

By "intervention" I mean interacting with the environment. Purpose a hypothesis, test, modify, test. You can frame RL this way though RL usually generates hypotheses that are far too naïve.

This looks like a good brief overview (I only skimmed it but wanted to give you more than "lol, google it") http://smithamilli.com/blog/causal-ladder/

replies(1): >>44432936 #

34. amelius ◴[01 Jul 25 11:46 UTC] No.44432936{7}[source]▶

>>44428379 #

Yes, you need to let the robot play (interact with the environment) to learn the vision-versus-touch correlations, but you can do so in an unsupervised way (as long as you choose the environment wisely).