(deepmind.google)

214 points meetpateltech | 1 comments | 24 Jun 25 14:05 UTC | HN request time: 0.241s | source

Show context

polskibus ◴[24 Jun 25 20:52 UTC] No.44370919[source]▶

What is the model architecture? I'm assuming it's far away from LLMs, but I'm curious about knowing more. Can anyone provide links that describe architectures for VLA?

replies(1): >>44371031 #

KoolKat23 ◴[24 Jun 25 21:03 UTC] No.44371031[source]▶

>>44370919 #

Actually very close to one I'd say.

It's a "visual language action" VLA model "built on the foundations of Gemini 2.0".

As Gemini 2.0 has native language, audio and video support, I suspect it has been adapted to include native "action" data too, perhaps only on output fine-tuning rather than input/output at training stage (given its Gemini 2.0 foundation).

Natively multimodal LLM's are basically brains.

replies(2): >>44371303 #>>44372072 #

martythemaniak ◴[24 Jun 25 21:34 UTC] No.44371303[source]▶

>>44371031 #

OpenVLA is basically a slightly modified, fine-tuned llama2. I found the launch/intro talk by lead author to be quite accessible: https://www.youtube.com/watch?v=-0s0v3q7mBk

replies(2): >>44373332 #>>44374719 #

1. m00x ◴[25 Jun 25 03:08 UTC] No.44373332[source]▶

>>44371303 #

A more modern one, smolVLA is similar and uses a VLM but skips a few layers and uses an action adapter for outputs. Both are from HF and run on LeRobot.

https://arxiv.org/abs/2506.01844

Explanation by PhosphoAI: https://www.youtube.com/watch?v=00A6j02v450

↑

Gemini Robotics On-Device brings AI to local robotic devices