We accidentally solved robotics by watching 1M hours of YouTube

(ksagar.bearblog.dev)

Show context

dchftcs ◴[30 Jun 25 03:53 UTC] No.44419191[source]▶

Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.

For example, so that you don't crush a human when doing massage (but still need to press hard), or apply the right amount of force (and finesse?) to skin a fish fillet without cutting the skin itself.

Practically in the near term, it's hard to sample from failure examples with videos on Youtube, such as when food spills out of the pot accidentally. Studying simple tasks through the happy path makes it hard to get the robot to figure out how to do something until it succeeds, which can appear even in relatively simple jobs like shuffling garbage.

With that said, I suppose a robot can be made to practice in real life after learning something from vision.

replies(4): >>44419561 #>>44419692 #>>44420011 #>>44426961 #

carlosdp ◴[30 Jun 25 06:16 UTC] No.44420011[source]▶

>>44419191 #

> Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.

I'm not sure that's necessarily true for a lot of tasks.

A good way to measure this in your head is this:

"If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"

When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.

It therefore follows that robots should be able to learn with just RGB images too! Counterexamples would be things like grabbing an egg without crushing, perhaps. Though I suspect that could also be done with just vision.

replies(9): >>44420219 #>>44420289 #>>44420630 #>>44420695 #>>44420919 #>>44421236 #>>44423275 #>>44425473 #>>44427030 #

1. jpc0 ◴[30 Jun 25 06:46 UTC] No.44420219[source]▶

>>44420011 #

I think you vastly underestimate how difficult the task you are proposing would be without depth or pressure indication, even for a super intelligence like humans.

Simple concept, pick up a glass and pour its content into a vertical hole the approximate size of your mouth. Think of all the failure modes that can be triggered in the trivial example you do multiple times a day, to do the same from a single camera feed with no other indicators would take you hours to master and you already are a super intelligent being.

replies(3): >>44420596 #>>44420608 #>>44420928 #

2. stavros ◴[30 Jun 25 07:53 UTC] No.44420596[source]▶

>>44420219 (TP) #

If I have to pour water into my mouth, you can bet it's going all over my shirt. That's not how we drink.

replies(1): >>44420718 #

3. jrimbault ◴[30 Jun 25 07:54 UTC] No.44420608[source]▶

>>44420219 (TP) #

A routine gesture I've done everyday for almost all my life: getting a glass out of the shelves and into my left hand. It seems like a no brainer, I open the cabinet with my left hand, take the glass with my right hand, throw the glass from my right hand to the left hand while closing the cabinet with my shoulder. Put the glass under the faucet with left hand, open the faucet with the right hand.

I have done this 3 seconds gesture, and variations of it, my whole life basically, and never noticed I was throwing the glass from one hand to the other without any visual feedback.

replies(1): >>44424435 #

4. jpc0 ◴[30 Jun 25 08:12 UTC] No.44420718[source]▶

>>44420596 #

Except this is the absolutely most common thing humans do, and my argument is that that it will spill water all over but rather that it will shatter numerous glasses, knock them over etc all before it has picked up the glass.

The same process will be repeated many times trying to move the glass to its “face” and then when either variable changes, plastic vs glass, size, shape, location and all bets are off purely because there just plainly is the enough information

5. var_cw ◴[30 Jun 25 08:45 UTC] No.44420928[source]▶

>>44420219 (TP) #

The point is how much non-vision sensors vs pure vision, helps humans to be humans. Don't you think this point was proven by LLMs already that generalizability doesn't come from multi-modality but by scaling a single modality itself? And jepa is for sure designed to do a better job at that than an LLM. So no doubt about raw scaling + RL boost will kick-in highly predictable & specific robotic movements.

replies(2): >>44422229 #>>44427112 #

6. datameta ◴[30 Jun 25 12:02 UTC] No.44422229[source]▶

>>44420928 #

> generalizability doesn't come from multi-modality but by scaling a single modality itself

Could you expand on what you mean by this?

7. gregmac ◴[30 Jun 25 15:22 UTC] No.44424435[source]▶

>>44420608 #

And you're used to the weight of the glass, which you instantly recognize when you pick it up. If it was a different weight than you were expecting, you'd probably slow down and be more deliberate.

If you were to just do the exact same robotic "throw" action with a glass of unexpected weight you'd maybe not throw hard enough and miss, or throw too hard and possibly break it.

8. godelski ◴[30 Jun 25 19:48 UTC] No.44427112[source]▶

>>44420928 #

  > LLMs already that generalizability

This is not a proven statement. In fact, it's pretty clear that they don't. They have some generalization but that's not enough for what you're inferring. The best way to show this is to carefully talk to an LLM about anything you have a lot of domain expertise in. Be careful to not give it answers (information leakage can sneak in subtly) and specifically look for those small subtle details (that's why it needs to be a topic you have expertise in). "The smell" will be right but the information won't.

Also, LLMs these days aren't trained on just language

↑