It's a "visual language action" VLA model "built on the foundations of Gemini 2.0".
As Gemini 2.0 has native language, audio and video support, I suspect it has been adapted to include native "action" data too, perhaps only on output fine-tuning rather than input/output at training stage (given its Gemini 2.0 foundation).
Natively multimodal LLM's are basically brains.
Absolutely not.
https://arxiv.org/abs/2506.01844
Explanation by PhosphoAI: https://www.youtube.com/watch?v=00A6j02v450
Only suggestion I have is “study more”.
If it looks like a duck and quacks like a duck...
Just because it is alien to you, does not mean it is not a brain, please go look up the definition of the word.
And my comment is useful, a VLA implies it is processing it's input and output natively, something a brain does hence my comment.