Imagine being able to create an accurate enough 3D rendering of any interior with just a bunch of snapshots anyone can take with their phone.
The splats are individual elongated 3D spheres -- thousands to millions of them -- floating in a 3D coordinate space. 3D pixels, essentially. Each with radiance colour properties so they might have different appearances from different angles (e.g. environmental lighting or reflection, etc.)
The magic is obviously figuring out how each pixel in a set of pictures correlates when translated to a 3D space filled with splats. Traditionally it took a load of pictures for it to rationalize, so doing it with two pictures is pretty amazing.
Note: I've made some simplifying assumption in the above explanation.
Historically camera poses have been estimated via 2D image matching techniques like SIFT [1], through software packages like COLMAP.
These algorithms work well when you have many images that methodically cover a scene. However they often struggle to produce accurate estimates in the few image regime, or “in the wild” where photos are taken casually with less rigorous scene coverage.
To address this, a major trend in the field is to move away from classical 2D algorithms, instead leveraging methods that incorporate 3D “priors” or knowledge of the world.
To that end, this paper builds heavily on MASt3R [2], which is a vision transformer model that has been trained to reconstruct a 3D scene from 2D image pairs. The authors added another projection head to output the initial parameters for each gaussian primitive. They further optimize the gaussians through some clever use of the original image pair and randomly selected and rendered novel views, which is basically the original 3DGS algorithm but using synthesized target images instead (hence “zero-shot” in the title).
I do think this general approach will dominate the field in the coming years, but it brings its own unique challenges.
In particular, the quadratic time complexity of transformers is the main computational bottleneck preventing this technique from being scaled up to more than two images at a time, and to resolutions beyond 512 x 512.
Also, naive image matching itself has quadratic time complexity, which is really painful with large dense latent vectors and can’t be accelerated with kd-trees due to the curse of dimensionality. That’s why the authors use a hierarchical coarse to fine algorithm that approximates the exact computation and achieves linear time complexity wrt to image resolution.
[1] https://en.m.wikipedia.org/wiki/Scale-invariant_feature_tran...
To get splats out of photos generally requires many calibrated photos all of which include highly accurate placement and orientation data attached. Like ideally cm-accurate 6 degree, plus as you mention calibration.
There are some pipelines that infer this data, and there is some work showing that models can be trained to do well with fuzzy/inaccurate position information. But they still would want like 30 photos + to do that desk demo video justice and look good.
Splatt3R (sigh on typing that name, geez) is a very different architecture, in that it’s a transformer model which is trained to ‘know’ about the world, and takes two (only) photos, combines that with its trained-in world knowledge, and hallucinates (infers) a set of Gaussians that it believes are a sort of plausible continuation from those two photos.
Now think to yourself “Could I approximate, say Kirby with 10 splats? And could I get a GPU to hone in on the best splats to approximate Kirby, maybe using gradient descent?” Then ask yourself “could I get a 4k+ resolution photographic grade 3D scene that included transmissive and reflective behavior using this method?”
If your answer to the second is “obviously!” Then you have a good head for this ML stuff. Somewhat surprisingly (shockingly?) The answer is ‘yes’, and also somewhat surprisingly, you can use ML pipelines and autograd type tech to hone in on what the 18 or so implied variables for a splat should be, and when you have millions of them, they look like AMAZING reconstructions of scenes. And also are incredibly quick to render.
Splats are pretty cool.