←back to thread

392 points lairv | 5 comments | | HN request time: 0.642s | source
1. NewUser76312 ◴[] No.45530251[source]
People comparing this to GPT-2 is very interesting. While it sounds like a nice analogy or even a good story to investors, the fundamentals are very different.

To train GPT, all of the training data (the internet of text, scanned books, etc) had already existed, even before the GPT project began. Arguably, the compute required (for GPT-3) also already existed, even before GPT-2.

The GPT project really just came down to investing in all of the pieces to take the ideas from a 2017 research paper to the next level. Nobody knew if X thousand GPUs, plus all of the internet's text, plus neural network transformers, would work out. But somebody took a risk in putting together the existing pieces, and proved that it can.

There's no analogy here to humanoid robotics. Not only is the data required for neural network operated humanoids close to non-existent (at the scale needed), but the nature of the data itself is enormously more complicated that taking a list of tokens in a vocabulary, and outputting 1 more token from the same vocabulary.

That being said, I still applaud the ambition of the Figure team. While I think it's clear they are presenting incredibly cherry-picked examples, they aren't trying to mislead consumers with a product for sale (because... they can't). Instead, they are productizing important research to investors, who would otherwise waste money on less important and less ambitious projects. So overall I find projects of this nature to be a net positive for technical innovation.

replies(4): >>45530501 #>>45530723 #>>45534837 #>>45539930 #
2. neom ◴[] No.45530501[source]
Not directly related to what you're saying (I take your point), but you might find these ideas interesting! :)

https://openreview.net/forum?id=3RSLW9YSgk

https://www.nature.com/articles/s41586-025-08744-2

https://arxiv.org/abs/2501.10100

https://www.datacamp.com/blog/genesis-physics-engine

3. sosodev ◴[] No.45530723[source]
Isn't that assuming training methods remain the same?

It seems like learning from the environment will be a requirement for robots to scale. My understanding is that research has been yielding new architectures that might have that type of real-time, general intelligence but we haven't seen that similarly large investment yet.

4. legucy ◴[] No.45534837[source]
Could we do RL in simulated environments, and use a vision LLM to provide the verification? I.e test a policy then take a 2d image of the end state, VLM yields 0 or 1.

Another idea: video extension model as a world model. We fine tune Sora on first person robot videos (and we train another model to predict actuation states from FPV). Then we extend the video using Sora “a robot in first person view finishes moving laundry from washer to dryer”. Then predict actuation states from the extended video?

5. tiborsaas ◴[] No.45539930[source]
> There's no analogy here to humanoid robotics.

We don't really need an analogy here as we just have to look at ourselves, we are the analogy. New training data comes from experiencing the world and learning from failed tasks.