←back to thread

145 points jasondavies | 1 comments | | HN request time: 0.229s | source
1. refibrillator ◴[] No.41371457[source]
Novel view synthesis via 3DGS requires knowledge of the camera pose for every input image, ie the cam position and orientation in 3D space.

Historically camera poses have been estimated via 2D image matching techniques like SIFT [1], through software packages like COLMAP.

These algorithms work well when you have many images that methodically cover a scene. However they often struggle to produce accurate estimates in the few image regime, or “in the wild” where photos are taken casually with less rigorous scene coverage.

To address this, a major trend in the field is to move away from classical 2D algorithms, instead leveraging methods that incorporate 3D “priors” or knowledge of the world.

To that end, this paper builds heavily on MASt3R [2], which is a vision transformer model that has been trained to reconstruct a 3D scene from 2D image pairs. The authors added another projection head to output the initial parameters for each gaussian primitive. They further optimize the gaussians through some clever use of the original image pair and randomly selected and rendered novel views, which is basically the original 3DGS algorithm but using synthesized target images instead (hence “zero-shot” in the title).

I do think this general approach will dominate the field in the coming years, but it brings its own unique challenges.

In particular, the quadratic time complexity of transformers is the main computational bottleneck preventing this technique from being scaled up to more than two images at a time, and to resolutions beyond 512 x 512.

Also, naive image matching itself has quadratic time complexity, which is really painful with large dense latent vectors and can’t be accelerated with kd-trees due to the curse of dimensionality. That’s why the authors use a hierarchical coarse to fine algorithm that approximates the exact computation and achieves linear time complexity wrt to image resolution.

[1] https://en.m.wikipedia.org/wiki/Scale-invariant_feature_tran...

[2] https://github.com/naver/mast3r