←back to thread

214 points meetpateltech | 3 comments | | HN request time: 0.587s | source
Show context
polskibus ◴[] No.44370919[source]
What is the model architecture? I'm assuming it's far away from LLMs, but I'm curious about knowing more. Can anyone provide links that describe architectures for VLA?
replies(1): >>44371031 #
KoolKat23 ◴[] No.44371031[source]
Actually very close to one I'd say.

It's a "visual language action" VLA model "built on the foundations of Gemini 2.0".

As Gemini 2.0 has native language, audio and video support, I suspect it has been adapted to include native "action" data too, perhaps only on output fine-tuning rather than input/output at training stage (given its Gemini 2.0 foundation).

Natively multimodal LLM's are basically brains.

replies(2): >>44371303 #>>44372072 #
1. martythemaniak ◴[] No.44371303[source]
OpenVLA is basically a slightly modified, fine-tuned llama2. I found the launch/intro talk by lead author to be quite accessible: https://www.youtube.com/watch?v=-0s0v3q7mBk
replies(2): >>44373332 #>>44374719 #
2. m00x ◴[] No.44373332[source]
A more modern one, smolVLA is similar and uses a VLM but skips a few layers and uses an action adapter for outputs. Both are from HF and run on LeRobot.

https://arxiv.org/abs/2506.01844

Explanation by PhosphoAI: https://www.youtube.com/watch?v=00A6j02v450

3. KoolKat23 ◴[] No.44374719[source]
In the paper at the bottom of googles page, this VLA says it is built on the foundations of Gemini 2.0 (hence my quotations). They'd be using Gemini 2.0 rather than llama.

https://arxiv.org/pdf/2503.20020