Most active commenters

godelski(5)
amelius(4)

Popular/hot comments

>>44420219 #

←back to thread

We accidentally solved robotics by watching 1M hours of YouTube

(ksagar.bearblog.dev)

Show context

dchftcs ◴[30 Jun 25 03:53 UTC] No.44419191[source]▶

>>44414171 (OP) #

Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.

For example, so that you don't crush a human when doing massage (but still need to press hard), or apply the right amount of force (and finesse?) to skin a fish fillet without cutting the skin itself.

Practically in the near term, it's hard to sample from failure examples with videos on Youtube, such as when food spills out of the pot accidentally. Studying simple tasks through the happy path makes it hard to get the robot to figure out how to do something until it succeeds, which can appear even in relatively simple jobs like shuffling garbage.

With that said, I suppose a robot can be made to practice in real life after learning something from vision.

replies(4): >>44419561 #>>44419692 #>>44420011 #>>44426961 #

1. carlosdp ◴[30 Jun 25 06:16 UTC] No.44420011[source]▶

>>44419191 #

> Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.

I'm not sure that's necessarily true for a lot of tasks.

A good way to measure this in your head is this:

"If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"

When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.

It therefore follows that robots should be able to learn with just RGB images too! Counterexamples would be things like grabbing an egg without crushing, perhaps. Though I suspect that could also be done with just vision.

replies(9): >>44420219 #>>44420289 #>>44420630 #>>44420695 #>>44420919 #>>44421236 #>>44423275 #>>44425473 #>>44427030 #

2. jpc0 ◴[30 Jun 25 06:46 UTC] No.44420219[source]▶

>>44420011 (TP) #

I think you vastly underestimate how difficult the task you are proposing would be without depth or pressure indication, even for a super intelligence like humans.

Simple concept, pick up a glass and pour its content into a vertical hole the approximate size of your mouth. Think of all the failure modes that can be triggered in the trivial example you do multiple times a day, to do the same from a single camera feed with no other indicators would take you hours to master and you already are a super intelligent being.

replies(3): >>44420596 #>>44420608 #>>44420928 #

3. moefh ◴[30 Jun 25 06:55 UTC] No.44420289[source]▶

>>44420011 (TP) #

> It therefore follows that robots should be able to learn with just RGB images too!

I don't see how that follows. Humans have trained by experimenting with actually manipulating things, not just by vision. It's not clear at all that someone who had gained intuition about the world exclusively by looking at it would have any success with mechanical arms.

replies(1): >>44423246 #

4. stavros ◴[30 Jun 25 07:53 UTC] No.44420596[source]▶

>>44420219 #

If I have to pour water into my mouth, you can bet it's going all over my shirt. That's not how we drink.

replies(1): >>44420718 #

5. jrimbault ◴[30 Jun 25 07:54 UTC] No.44420608[source]▶

>>44420219 #

A routine gesture I've done everyday for almost all my life: getting a glass out of the shelves and into my left hand. It seems like a no brainer, I open the cabinet with my left hand, take the glass with my right hand, throw the glass from my right hand to the left hand while closing the cabinet with my shoulder. Put the glass under the faucet with left hand, open the faucet with the right hand.

I have done this 3 seconds gesture, and variations of it, my whole life basically, and never noticed I was throwing the glass from one hand to the other without any visual feedback.

replies(1): >>44424435 #

6. jaisio ◴[30 Jun 25 07:56 UTC] No.44420630[source]▶

>>44420011 (TP) #

> When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.

And where does this intuition come from? It was buily by also feeling other sensations in addition to vision. You learned how gravity pulls things down when you were a kid. How hot/cold feels, how hard/soft feels, how thing smell. Your mental model of the world is substantially informed by non-visual clues.

> It therefore follows that robots should be able to learn with just RGB images too!

That does not follow at all! It's not how you learned either.

Neither have you learned to think by consuming the entirety of all text produced on the internet. LLMs therefore don't think, they are just pretty good at faking the appearance of thinking.

7. suddenlybananas ◴[30 Jun 25 08:06 UTC] No.44420695[source]▶

>>44420011 (TP) #

Humans have innate knowledge that help them interact with the world and can learn from physical interaction for the rest. RGB images aren't enough.

replies(1): >>44420722 #

8. jpc0 ◴[30 Jun 25 08:12 UTC] No.44420718{3}[source]▶

>>44420596 #

Except this is the absolutely most common thing humans do, and my argument is that that it will spill water all over but rather that it will shatter numerous glasses, knock them over etc all before it has picked up the glass.

The same process will be repeated many times trying to move the glass to its “face” and then when either variable changes, plastic vs glass, size, shape, location and all bets are off purely because there just plainly is the enough information

9. whatever1 ◴[30 Jun 25 08:12 UTC] No.44420722[source]▶

>>44420695 #

Video games have shown that we can control pretty darn well characters in virtual worlds where we have not experienced their physics. We just look at a 2D monitor and using a joystick/keyboard we manage to figure it out.

replies(2): >>44421108 #>>44421256 #

10. abenga ◴[30 Jun 25 08:44 UTC] No.44420919[source]▶

>>44420011 (TP) #

Humans did not accumulate that intuition just using images. In the example you gave, you subconsciously augment the image information with a lifetime of interacting with the world using all the other senses.

replies(1): >>44422133 #

11. var_cw ◴[30 Jun 25 08:45 UTC] No.44420928[source]▶

>>44420219 #

The point is how much non-vision sensors vs pure vision, helps humans to be humans. Don't you think this point was proven by LLMs already that generalizability doesn't come from multi-modality but by scaling a single modality itself? And jepa is for sure designed to do a better job at that than an LLM. So no doubt about raw scaling + RL boost will kick-in highly predictable & specific robotic movements.

replies(2): >>44422229 #>>44427112 #

12. suddenlybananas ◴[30 Jun 25 09:11 UTC] No.44421108{3}[source]▶

>>44420722 #

Yeah but we already have a conception of what physics should be prior to that that helps us enormously. It's not like game designers are coming up with stuff that intentionally breaks our naïve physics.

replies(1): >>44427148 #

13. deadfoxygrandpa ◴[30 Jun 25 09:33 UTC] No.44421236[source]▶

>>44420011 (TP) #

counterpoint: think about all the tasks you could do with your hands and arms while your eyes are closed. i think its really a lot of stuff considering blind people can do the vast majority of things sighted people can do, and i suspect anything you could do with your eyes closed would be extremely difficult to do with a camera feed as the literal only sensory input

14. deadfoxygrandpa ◴[30 Jun 25 09:35 UTC] No.44421256{3}[source]▶

>>44420722 #

a game has very limited physics. like the buttons you press are pre-tuned to perform certain actions and you arent dealing with continuous nearly infinite possibilities with large ranges of motion, pressure, speed etc. like think about how difficult the game QWOP is because you mostly just have visual feedback

replies(1): >>44428032 #

15. amelius ◴[30 Jun 25 11:49 UTC] No.44422133[source]▶

>>44420919 #

Yes, without extra information, manipulating everyday objects is probably as intuitive to robots as manipulating quantum scale molecules is for humans.

16. datameta ◴[30 Jun 25 12:02 UTC] No.44422229{3}[source]▶

>>44420928 #

> generalizability doesn't come from multi-modality but by scaling a single modality itself

Could you expand on what you mean by this?

17. amelius ◴[30 Jun 25 13:49 UTC] No.44423246[source]▶

>>44420289 #

You'd use a two-step approach.

1. First create a model that can evaluate how well a task is going; the YT approach can be used here.

2. Then build a real-world robot, and train it by letting it do tasks, and use the first model to supervise it; here the robot can learn to rely on extra senses such as touch/pressure.

replies(1): >>44427051 #

18. corimaith ◴[30 Jun 25 13:52 UTC] No.44423275[source]▶

>>44420011 (TP) #

>"If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"

There are an infinite number of scenes that can be matched to one 2d picture. And what is a scene really? The last time I checked, RGB was not a good way of input in Computer Vision and rather relied on increasing levels of gradients via CNNs to build a compositional scene. None of that is paticularly translatable to how a LM works with text.

19. gregmac ◴[30 Jun 25 15:22 UTC] No.44424435{3}[source]▶

>>44420608 #

And you're used to the weight of the glass, which you instantly recognize when you pick it up. If it was a different weight than you were expecting, you'd probably slow down and be more deliberate.

If you were to just do the exact same robotic "throw" action with a glass of unexpected weight you'd maybe not throw hard enough and miss, or throw too hard and possibly break it.

20. ◴[30 Jun 25 16:55 UTC] No.44425473[source]▶

>>44420011 (TP) #

21. godelski ◴[30 Jun 25 19:37 UTC] No.44427030[source]▶

>>44420011 (TP) #

  > because you as a human have really good intuition about the world.

This is the line that causes your logic to fail.

You introduced knowledge not obtained through observation. In fact, the knowledge you introduced is the whole chimichanga! It is an easy mistake to make, so don't feel embarrassed.

The claim is that one can learn a world model[0] through vision. The patent countered by saying "vision is not enough." Then you countered by saying "vision is enough if you already have a world model."

[0] I'll be more precise here. You can learn *A* world model, but it isn't the one we really care about and "a world" doesn't require being a self consistent world. We could say the same thing about "a physics", but let's be real, when we say "physics" we know which one is being discussed...

22. godelski ◴[30 Jun 25 19:40 UTC] No.44427051{3}[source]▶

>>44423246 #

You're agreeing with the parent btw. You've introduced a lot more than just vision. You introduced interventional experimentation. That's a lot more than just observation

replies(1): >>44427185 #

23. godelski ◴[30 Jun 25 19:48 UTC] No.44427112{3}[source]▶

>>44420928 #

  > LLMs already that generalizability

This is not a proven statement. In fact, it's pretty clear that they don't. They have some generalization but that's not enough for what you're inferring. The best way to show this is to carefully talk to an LLM about anything you have a lot of domain expertise in. Be careful to not give it answers (information leakage can sneak in subtly) and specifically look for those small subtle details (that's why it needs to be a topic you have expertise in). "The smell" will be right but the information won't.

Also, LLMs these days aren't trained on just language

24. godelski ◴[30 Jun 25 19:54 UTC] No.44427148{4}[source]▶

>>44421108 #

I mean they do but we often have generalized (to some degree) world models. So when they do things like change gravity, flip things upside down, or even more egregious changes we can adapt. Because we have contractual counterfactual models. But yeah, they could change things so much that you'd really have to relearn and that could be very very difficult if not impossible (I wonder if anyone has created a playable game with a physics that's impossible for humans to learn, at least without "pen and paper". I think you could do this by putting the game in higher dimensions.)

25. amelius ◴[30 Jun 25 19:57 UTC] No.44427185{4}[source]▶

>>44427051 #

What I describe is an unsupervised system.

What you say ("interventional") sounds like it's human-supervised.

But maybe I'm interpreting it in the wrong way, so please correct me if so.

replies(1): >>44428379 #

26. whatever1 ◴[30 Jun 25 21:23 UTC] No.44428032{4}[source]▶

>>44421256 #

I beg to disagree. I got introduced to brand new (to me) physics of flying airplanes by MS flight simulator. None of the rules I knew in real life applied (gravity matters only sometimes, height can be traded for speed etc). Yet learned how to fly.

And when I took real classes in a real Cessna, this experience was transferable (aka the flying model I had in my brain was very similar to the one I experienced with my full body in the cockpit).

27. godelski ◴[30 Jun 25 22:02 UTC] No.44428379{5}[source]▶

>>44427185 #

By "intervention" I mean interacting with the environment. Purpose a hypothesis, test, modify, test. You can frame RL this way though RL usually generates hypotheses that are far too naïve.

This looks like a good brief overview (I only skimmed it but wanted to give you more than "lol, google it") http://smithamilli.com/blog/causal-ladder/

replies(1): >>44432936 #

28. amelius ◴[01 Jul 25 11:46 UTC] No.44432936{6}[source]▶

>>44428379 #

Yes, you need to let the robot play (interact with the environment) to learn the vision-versus-touch correlations, but you can do so in an unsupervised way (as long as you choose the environment wisely).

↑