Magma: A foundation model for multimodal AI agents

1. lelag ◴[20 Feb 25 12:29 UTC] No.43113916[source]▶

Really interesting model, I'm looking forward to play with it.

But what I want is a multimodal agent model capable of generating embeddings for a humanoid control model like Meta motivo[0] rather than directly outputting coordinates.

Meta motivo is still a toy model, trained on the SMPL skeleton, which lacks fingers which limits its capabilities beside having some fun with it. They could have used a more advanced based model, SMPL-X, which includes fingers, but there isn’t enough open motion data with precise finger motion to train a robust manipulation model anyway.

Most existing motion datasets come from academic motion capture setups, which are complex and not focused on manipulation tasks (and also pretty old). I believe this gap will be filled by improvements in 3D HPE from 2D video. With access to thousands of hours of video, we can build large-scale motion datasets covering a wide range of real-world interactions.

This will enable training the two components needed for dexterous humanoid robots: the agentic model that decides what actions to take and generates embeddings that can be read by a control model that accurately models hand and finger joint movement.

Given the rapid progress in the capabilities of SoTA 3D HPE from 2D video, and the vast amount of videos online (Youtube), I expect we will see humanoid robots with good manipulation capabilities it the not so distant future.

[0]: https://github.com/facebookresearch/metamotivo

replies(2): >>43114115 #>>43114172 #

2. michaelbuckbee ◴[20 Feb 25 12:57 UTC] No.43114115[source]▶

>>43113916 (TP) #

Trying to wrap my head around this - are you saying that those models are trained around the concept of fingers (some kind of physical manipulators with set dimensions)?

replies(1): >>43114208 #

3. lelag ◴[20 Feb 25 13:05 UTC] No.43114172[source]▶

>>43113916 (TP) #

Some more thoughts about training a manipulation model: I would add that synthetic data might be key to making it happen.

One issue is that most video is not shot in first person, so it might make for a poor dataset for the agentic part assuming the robot has human like vision.

Still if you have a large data set of motion capture data with reasonably accurate finger mouvement, you could use a video diffusion model with a control net to get a realistic looking video of a specific motion in first person. Another way would be to use a model like dust3r to generate a geometric 3d scene from the initial video allowing to change the camera angle to match a first person view.

This could be used as the dataset for the agentic model.

Now, maybe human like vision is not even necessary, unlike human, there is nothing preventing your robot to see through external camera placed around the house. Hell, there's even a good chance, your robot's brain will live in a datacenter hundreds of mile away.

4. lelag ◴[20 Feb 25 13:10 UTC] No.43114208[source]▶

>>43114115 #

The SMPL-x body model, a standard in this academic field does model fingers https://smpl-x.is.tue.mpg.de/

The issue is that there are much less dataset available for it than for the simplier SMPL model.

Regarding fingers, you already have "dumb" models like https://github.com/google-deepmind/mujoco_mpc which can control finger mouvement to achieve specific task.

Look at this video to see it action: https://www.youtube.com/watch?v=2xVN-qY78P4&t=387s

Pretty cool stuff.