What is the model architecture? I'm assuming it's far away from LLMs, but I'm curious about knowing more. Can anyone provide links that describe architectures for VLA?
replies(1):
It's a "visual language action" VLA model "built on the foundations of Gemini 2.0".
As Gemini 2.0 has native language, audio and video support, I suspect it has been adapted to include native "action" data too, perhaps only on output fine-tuning rather than input/output at training stage (given its Gemini 2.0 foundation).
Natively multimodal LLM's are basically brains.
Absolutely not.
Only suggestion I have is “study more”.
If it looks like a duck and quacks like a duck...
Just because it is alien to you, does not mean it is not a brain, please go look up the definition of the word.
And my comment is useful, a VLA implies it is processing it's input and output natively, something a brain does hence my comment.