We accidentally solved robotics by watching 1M hours of YouTube

(ksagar.bearblog.dev)

209 points alexcos | 1 comments | 29 Jun 25 16:08 UTC | HN request time: 0.202s | source

Show context

dchftcs ◴[30 Jun 25 03:53 UTC] No.44419191[source]▶

Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.

For example, so that you don't crush a human when doing massage (but still need to press hard), or apply the right amount of force (and finesse?) to skin a fish fillet without cutting the skin itself.

Practically in the near term, it's hard to sample from failure examples with videos on Youtube, such as when food spills out of the pot accidentally. Studying simple tasks through the happy path makes it hard to get the robot to figure out how to do something until it succeeds, which can appear even in relatively simple jobs like shuffling garbage.

With that said, I suppose a robot can be made to practice in real life after learning something from vision.

replies(4): >>44419561 #>>44419692 #>>44420011 #>>44426961 #

carlosdp ◴[30 Jun 25 06:16 UTC] No.44420011[source]▶

>>44419191 #

> Pure vision will never be enough because it does not contain information about the physical feedback like pressure and touch, or the strength required to perform a task.

I'm not sure that's necessarily true for a lot of tasks.

A good way to measure this in your head is this:

"If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"

When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.

It therefore follows that robots should be able to learn with just RGB images too! Counterexamples would be things like grabbing an egg without crushing, perhaps. Though I suspect that could also be done with just vision.

replies(9): >>44420219 #>>44420289 #>>44420630 #>>44420695 #>>44420919 #>>44421236 #>>44423275 #>>44425473 #>>44427030 #

moefh ◴[30 Jun 25 06:55 UTC] No.44420289[source]▶

>>44420011 #

> It therefore follows that robots should be able to learn with just RGB images too!

I don't see how that follows. Humans have trained by experimenting with actually manipulating things, not just by vision. It's not clear at all that someone who had gained intuition about the world exclusively by looking at it would have any success with mechanical arms.

replies(1): >>44423246 #

amelius ◴[30 Jun 25 13:49 UTC] No.44423246[source]▶

>>44420289 #

You'd use a two-step approach.

1. First create a model that can evaluate how well a task is going; the YT approach can be used here.

2. Then build a real-world robot, and train it by letting it do tasks, and use the first model to supervise it; here the robot can learn to rely on extra senses such as touch/pressure.

replies(1): >>44427051 #

godelski ◴[30 Jun 25 19:40 UTC] No.44427051[source]▶

>>44423246 #

You're agreeing with the parent btw. You've introduced a lot more than just vision. You introduced interventional experimentation. That's a lot more than just observation

replies(1): >>44427185 #

amelius ◴[30 Jun 25 19:57 UTC] No.44427185[source]▶

>>44427051 #

What I describe is an unsupervised system.

What you say ("interventional") sounds like it's human-supervised.

But maybe I'm interpreting it in the wrong way, so please correct me if so.

replies(1): >>44428379 #

godelski ◴[30 Jun 25 22:02 UTC] No.44428379[source]▶

>>44427185 #

By "intervention" I mean interacting with the environment. Purpose a hypothesis, test, modify, test. You can frame RL this way though RL usually generates hypotheses that are far too naïve.

This looks like a good brief overview (I only skimmed it but wanted to give you more than "lol, google it") http://smithamilli.com/blog/causal-ladder/

replies(1): >>44432936 #

1. amelius ◴[01 Jul 25 11:46 UTC] No.44432936[source]▶

>>44428379 #

Yes, you need to let the robot play (interact with the environment) to learn the vision-versus-touch correlations, but you can do so in an unsupervised way (as long as you choose the environment wisely).

↑