Splatt3R: Zero-Shot Gaussian Splatting from Uncalibrated Image Pairs

(splatt3r.active.vision)

1. jonhohle ◴[27 Aug 24 13:53 UTC] No.41367536[source]▶

The mirror in the example with the washing machine is amazing. Obviously the model doesn’t understand that it’s a mirror so renders it as if it were a window with volume behind the wall. But it does it so realistically that it produces the same effect as a mirror when viewed from different angles. This feels like something out of a sci-fi detective movie.

replies(3): >>41368494 #>>41372857 #>>41374230 #

2. S0y ◴[27 Aug 24 15:10 UTC] No.41368437[source]▶

>>41366006 (OP) #

This is really awesome. A question for someone who knows more about this: How much harder would it be to make this work using any number of photos? I'm assuming this is the end goal for a model like this.

Imagine being able to create an accurate enough 3D rendering of any interior with just a bunch of snapshots anyone can take with their phone.

replies(2): >>41368608 #>>41368627 #

3. scoopdewoop ◴[27 Aug 24 15:15 UTC] No.41368494[source]▶

>>41367536 #

Ha, reminds me of Duke Nukem mirrors which are essentially the same thing, looking through a window to mirrored geometry

replies(2): >>41370950 #>>41372402 #

4. dagmx ◴[27 Aug 24 15:25 UTC] No.41368608[source]▶

>>41368437 #

That’s already how Gaussian splats work.

They’re novelty of splattr (though I contest that they’re the first to do so) is that they need fewer images than usual.

replies(2): >>41368661 #>>41368790 #

5. Arkanum ◴[27 Aug 24 15:27 UTC] No.41368627[source]▶

>>41368437 #

Probably not much harder, but you wouldn't get the same massive jump in quality that you get going from 1 image to 2. NeRF/Gaussian Splatting in general is what you're describing, but from the looks of it, this just does it in a single forward pass rather than optimising the gaussian/network weights.

6. Arkanum ◴[27 Aug 24 15:30 UTC] No.41368661{3}[source]▶

>>41368608 #

I think the novelty is that they don't have to optimise the splats at all, they're directly predicted in a single forward pass.

replies(1): >>41372531 #

7. GaggiX ◴[27 Aug 24 15:41 UTC] No.41368790{3}[source]▶

>>41368608 #

The novelty here is that it does work on uncalibrated images.

replies(2): >>41369005 #>>41372536 #

8. rkagerer ◴[27 Aug 24 15:43 UTC] No.41368805[source]▶

>>41366006 (OP) #

What is a splat?

replies(4): >>41369333 #>>41369665 #>>41369971 #>>41375736 #

9. milleramp ◴[27 Aug 24 16:02 UTC] No.41369005{4}[source]▶

>>41368790 #

Not really, it is using Mast3r to determine camera poses.

10. llm_nerd ◴[27 Aug 24 16:26 UTC] No.41369333[source]▶

>>41368805 #

You have a car in real life and want to visualize it in software. You take some pictures of the car from various angles -- each picture a 2D array of pixel data -- and process it through software, transforming it into effectively 3D pixels: Splats.

The splats are individual elongated 3D spheres -- thousands to millions of them -- floating in a 3D coordinate space. 3D pixels, essentially. Each with radiance colour properties so they might have different appearances from different angles (e.g. environmental lighting or reflection, etc.)

The magic is obviously figuring out how each pixel in a set of pictures correlates when translated to a 3D space filled with splats. Traditionally it took a load of pictures for it to rationalize, so doing it with two pictures is pretty amazing.

replies(2): >>41369872 #>>41371212 #

11. boltzmann64 ◴[27 Aug 24 16:47 UTC] No.41369665[source]▶

>>41368805 #

When you throw a balloon of colored water at a wall, the impression it makes on the wall is called a splat. Say you have a function which takes a point in 3d and outputs a density value which goes to zero as you move away to infinity from the from the functions location (mean) like a bell curve (literally), and you throw (project) that function to plane (your camera film), you get a splat.

Note: I've made some simplifying assumption in the above explanation.

12. CamperBob2 ◴[27 Aug 24 16:59 UTC] No.41369872{3}[source]▶

>>41369333 #

So, voxels, then...?

replies(2): >>41370107 #>>41370956 #

13. dimatura ◴[27 Aug 24 17:07 UTC] No.41369971[source]▶

>>41368805 #

I'm not a computer graphics expert, but traditionally (since long before the latest 3D gaussian splatting) I've seen splatting used in computer graphics to describes a way of rendering 3D elements onto a 2D canvas with some "transparency", similar to 2D alpha compositing. I think the word derives from "splatter" - like what happens when you throw a tomato against a wall, except here you're throwing some 3D entity onto the camera plane. In the current context of 3D gaussian splatting, the entities that are splatted are 3D gaussians, and the parameters of those 3D gaussians are inferred with optimization at run time and/or predicted from a trained model.

14. llm_nerd ◴[27 Aug 24 17:15 UTC] No.41370107{4}[source]▶

>>41369872 #

Similar, but with the significant difference that splats are elongated spheres with variable orientation and elongation. Voxels are fixed sized, fixed orientation cubes. Splatting can be much more efficient for many scenarios than voxels.

15. kridsdale3 ◴[27 Aug 24 18:22 UTC] No.41370950{3}[source]▶

>>41368494 #

Damn, I'm lookin' good.

16. kridsdale3 ◴[27 Aug 24 18:23 UTC] No.41370956{4}[source]▶

>>41369872 #

Fuzzy, round-ish voxels.

17. teqsun ◴[27 Aug 24 18:39 UTC] No.41371177[source]▶

>>41366006 (OP) #

Just to check my understanding, the novel part of this is the fact that it generates it from two pictures from any camera without custom hand-calibration for that particular camera, and everything else involved is existing technology?

replies(1): >>41375705 #

18. bredren ◴[27 Aug 24 18:42 UTC] No.41371212{3}[source]▶

>>41369333 #

What do you call the processing after a splat, that identifies what's in the model and generates what should exist on the other side?

19. refibrillator ◴[27 Aug 24 19:00 UTC] No.41371457[source]▶

>>41366006 (OP) #

Novel view synthesis via 3DGS requires knowledge of the camera pose for every input image, ie the cam position and orientation in 3D space.

Historically camera poses have been estimated via 2D image matching techniques like SIFT [1], through software packages like COLMAP.

These algorithms work well when you have many images that methodically cover a scene. However they often struggle to produce accurate estimates in the few image regime, or “in the wild” where photos are taken casually with less rigorous scene coverage.

To address this, a major trend in the field is to move away from classical 2D algorithms, instead leveraging methods that incorporate 3D “priors” or knowledge of the world.

To that end, this paper builds heavily on MASt3R [2], which is a vision transformer model that has been trained to reconstruct a 3D scene from 2D image pairs. The authors added another projection head to output the initial parameters for each gaussian primitive. They further optimize the gaussians through some clever use of the original image pair and randomly selected and rendered novel views, which is basically the original 3DGS algorithm but using synthesized target images instead (hence “zero-shot” in the title).

I do think this general approach will dominate the field in the coming years, but it brings its own unique challenges.

In particular, the quadratic time complexity of transformers is the main computational bottleneck preventing this technique from being scaled up to more than two images at a time, and to resolutions beyond 512 x 512.

Also, naive image matching itself has quadratic time complexity, which is really painful with large dense latent vectors and can’t be accelerated with kd-trees due to the curse of dimensionality. That’s why the authors use a hierarchical coarse to fine algorithm that approximates the exact computation and achieves linear time complexity wrt to image resolution.

[1] https://en.m.wikipedia.org/wiki/Scale-invariant_feature_tran...

[2] https://github.com/naver/mast3r

20. recursive ◴[27 Aug 24 20:10 UTC] No.41372402{3}[source]▶

>>41368494 #

I think most video game mirrors use basically the same technique.

replies(1): >>41375766 #

21. dagmx ◴[27 Aug 24 20:21 UTC] No.41372531{4}[source]▶

>>41368661 #

That’s not really novel either imho, though google search is escaping me on the specific papers I saw at siggraph.

Imho it’s an interesting combination of technologies but not novel in an off itself.

22. dagmx ◴[27 Aug 24 20:22 UTC] No.41372536{4}[source]▶

>>41368790 #

A lot of splats systems do work on uncalibrated images so that’s not novel either. They all just do a camera solve, which arguable isn’t terrible for a stereo pair with low divergence.

23. HappMacDonald ◴[27 Aug 24 20:50 UTC] No.41372857[source]▶

>>41367536 #

Would love to see it try to handle a scene where the real volume behind the mirror were also available then. :9

24. ghostly_s ◴[27 Aug 24 23:07 UTC] No.41374230[source]▶

>>41367536 #

I wish they would just link to prerendered output from the examples. I can never get output from demos on huggingface successfully , I assume due to load throttling when projects get attention?

25. vessenes ◴[28 Aug 24 03:40 UTC] No.41375705[source]▶

>>41371177 #

Good question! “Existing technology” is doing a lot of work here, enough that I would say no.

To get splats out of photos generally requires many calibrated photos all of which include highly accurate placement and orientation data attached. Like ideally cm-accurate 6 degree, plus as you mention calibration.

There are some pipelines that infer this data, and there is some work showing that models can be trained to do well with fuzzy/inaccurate position information. But they still would want like 30 photos + to do that desk demo video justice and look good.

Splatt3R (sigh on typing that name, geez) is a very different architecture, in that it’s a transformer model which is trained to ‘know’ about the world, and takes two (only) photos, combines that with its trained-in world knowledge, and hallucinates (infers) a set of Gaussians that it believes are a sort of plausible continuation from those two photos.

26. vessenes ◴[28 Aug 24 03:45 UTC] No.41375736[source]▶

>>41368805 #

Think of a single splat as a Gaussian ellipsoid placed in (usually) 3 dimensions, with size, shape, alpha falloff, color (varying both at any point in the ellipsoid, and when viewed from any angle at that point.)

Now think to yourself “Could I approximate, say Kirby with 10 splats? And could I get a GPU to hone in on the best splats to approximate Kirby, maybe using gradient descent?” Then ask yourself “could I get a 4k+ resolution photographic grade 3D scene that included transmissive and reflective behavior using this method?”

If your answer to the second is “obviously!” Then you have a good head for this ML stuff. Somewhat surprisingly (shockingly?) The answer is ‘yes’, and also somewhat surprisingly, you can use ML pipelines and autograd type tech to hone in on what the 18 or so implied variables for a splat should be, and when you have millions of them, they look like AMAZING reconstructions of scenes. And also are incredibly quick to render.

Splats are pretty cool.

27. HexDecOctBin ◴[28 Aug 24 03:49 UTC] No.41375766{4}[source]▶

>>41372402 #

Nowadays, reflections can be done through raytracing, with perhaps a blur/smudge to hide low ray counts for performance.