To get splats out of photos generally requires many calibrated photos all of which include highly accurate placement and orientation data attached. Like ideally cm-accurate 6 degree, plus as you mention calibration.
There are some pipelines that infer this data, and there is some work showing that models can be trained to do well with fuzzy/inaccurate position information. But they still would want like 30 photos + to do that desk demo video justice and look good.
Splatt3R (sigh on typing that name, geez) is a very different architecture, in that it’s a transformer model which is trained to ‘know’ about the world, and takes two (only) photos, combines that with its trained-in world knowledge, and hallucinates (infers) a set of Gaussians that it believes are a sort of plausible continuation from those two photos.