←back to thread

145 points jasondavies | 1 comments | | HN request time: 0.364s | source
Show context
teqsun ◴[] No.41371177[source]
Just to check my understanding, the novel part of this is the fact that it generates it from two pictures from any camera without custom hand-calibration for that particular camera, and everything else involved is existing technology?
replies(1): >>41375705 #
1. vessenes ◴[] No.41375705[source]
Good question! “Existing technology” is doing a lot of work here, enough that I would say no.

To get splats out of photos generally requires many calibrated photos all of which include highly accurate placement and orientation data attached. Like ideally cm-accurate 6 degree, plus as you mention calibration.

There are some pipelines that infer this data, and there is some work showing that models can be trained to do well with fuzzy/inaccurate position information. But they still would want like 30 photos + to do that desk demo video justice and look good.

Splatt3R (sigh on typing that name, geez) is a very different architecture, in that it’s a transformer model which is trained to ‘know’ about the world, and takes two (only) photos, combines that with its trained-in world knowledge, and hallucinates (infers) a set of Gaussians that it believes are a sort of plausible continuation from those two photos.