>the model is basically a diva about camera positioning. move the camera 10 degrees and suddenly it thinks left is right and up is down.
Reminds that several years ago Tesla had to finally start to explicitly extract 3D model from the net. Similarly i expect here that it would get pipelined - one model extracts/builds 3D, and the other is actually the "robot" working in that 3D. Each one can be alone trained much better and efficiently, with much better transfer and generalization, than the large monolithic model working from the 2D video. In pipeline approach, it is very easy to generate synthetic input 3D data better covering interesting scenario space for the "robot" model.
And, for example, you can't just, without significant training, feed the large monolithic model a lidar point space instead of videos. Whereis in a pipelined approach, you just switch the 3D generating pipeline input model.