Gemini Robotics On-Device brings AI to local robotic devices

(deepmind.google)

Show context

polskibus ◴[24 Jun 25 20:52 UTC] No.44370919[source]▶

What is the model architecture? I'm assuming it's far away from LLMs, but I'm curious about knowing more. Can anyone provide links that describe architectures for VLA?

replies(1): >>44371031 #

1. KoolKat23 ◴[24 Jun 25 21:03 UTC] No.44371031[source]▶

>>44370919 #

Actually very close to one I'd say.

It's a "visual language action" VLA model "built on the foundations of Gemini 2.0".

As Gemini 2.0 has native language, audio and video support, I suspect it has been adapted to include native "action" data too, perhaps only on output fine-tuning rather than input/output at training stage (given its Gemini 2.0 foundation).

Natively multimodal LLM's are basically brains.

replies(2): >>44371303 #>>44372072 #

2. martythemaniak ◴[24 Jun 25 21:34 UTC] No.44371303[source]▶

>>44371031 (TP) #

OpenVLA is basically a slightly modified, fine-tuned llama2. I found the launch/intro talk by lead author to be quite accessible: https://www.youtube.com/watch?v=-0s0v3q7mBk

replies(2): >>44373332 #>>44374719 #

3. quantumHazer ◴[24 Jun 25 23:12 UTC] No.44372072[source]▶

>>44371031 (TP) #

> Natively multimodal LLM's are basically brains.

Absolutely not.

replies(1): >>44374702 #

4. m00x ◴[25 Jun 25 03:08 UTC] No.44373332[source]▶

>>44371303 #

A more modern one, smolVLA is similar and uses a VLM but skips a few layers and uses an action adapter for outputs. Both are from HF and run on LeRobot.

https://arxiv.org/abs/2506.01844

Explanation by PhosphoAI: https://www.youtube.com/watch?v=00A6j02v450

5. KoolKat23 ◴[25 Jun 25 07:56 UTC] No.44374702[source]▶

>>44372072 #

Lol keep telling yourself that. It's not a human brain nor is it necessarily a very intelligent brain, but it is a brain nonetheless.

replies(1): >>44380941 #

6. KoolKat23 ◴[25 Jun 25 07:58 UTC] No.44374719[source]▶

>>44371303 #

In the paper at the bottom of googles page, this VLA says it is built on the foundations of Gemini 2.0 (hence my quotations). They'd be using Gemini 2.0 rather than llama.

https://arxiv.org/pdf/2503.20020

7. quantumHazer ◴[25 Jun 25 19:19 UTC] No.44380941{3}[source]▶

>>44374702 #

Not a useful commentary. ANN and BNN are slightly correlated. That fact that you want to believe it is a brain tells a lot about you, but it doesn’t make a model a brain.

Only suggestion I have is “study more”.

replies(1): >>44392117 #

8. KoolKat23 ◴[26 Jun 25 22:30 UTC] No.44392117{4}[source]▶

>>44380941 #

They're not merely slightly correlated.

If it looks like a duck and quacks like a duck...

Just because it is alien to you, does not mean it is not a brain, please go look up the definition of the word.

And my comment is useful, a VLA implies it is processing it's input and output natively, something a brain does hence my comment.

↑