Most active commenters
  • dagmx(3)

145 points jasondavies | 28 comments | | HN request time: 1.261s | source | bottom
1. jonhohle ◴[] No.41367536[source]
The mirror in the example with the washing machine is amazing. Obviously the model doesn’t understand that it’s a mirror so renders it as if it were a window with volume behind the wall. But it does it so realistically that it produces the same effect as a mirror when viewed from different angles. This feels like something out of a sci-fi detective movie.
replies(3): >>41368494 #>>41372857 #>>41374230 #
2. S0y ◴[] No.41368437[source]
This is really awesome. A question for someone who knows more about this: How much harder would it be to make this work using any number of photos? I'm assuming this is the end goal for a model like this.

Imagine being able to create an accurate enough 3D rendering of any interior with just a bunch of snapshots anyone can take with their phone.

replies(2): >>41368608 #>>41368627 #
3. scoopdewoop ◴[] No.41368494[source]
Ha, reminds me of Duke Nukem mirrors which are essentially the same thing, looking through a window to mirrored geometry
replies(2): >>41370950 #>>41372402 #
4. dagmx ◴[] No.41368608[source]
That’s already how Gaussian splats work.

They’re novelty of splattr (though I contest that they’re the first to do so) is that they need fewer images than usual.

replies(2): >>41368661 #>>41368790 #
5. Arkanum ◴[] No.41368627[source]
Probably not much harder, but you wouldn't get the same massive jump in quality that you get going from 1 image to 2. NeRF/Gaussian Splatting in general is what you're describing, but from the looks of it, this just does it in a single forward pass rather than optimising the gaussian/network weights.
6. Arkanum ◴[] No.41368661{3}[source]
I think the novelty is that they don't have to optimise the splats at all, they're directly predicted in a single forward pass.
replies(1): >>41372531 #
7. GaggiX ◴[] No.41368790{3}[source]
The novelty here is that it does work on uncalibrated images.
replies(2): >>41369005 #>>41372536 #
8. rkagerer ◴[] No.41368805[source]
What is a splat?
replies(4): >>41369333 #>>41369665 #>>41369971 #>>41375736 #
9. milleramp ◴[] No.41369005{4}[source]
Not really, it is using Mast3r to determine camera poses.
10. llm_nerd ◴[] No.41369333[source]
You have a car in real life and want to visualize it in software. You take some pictures of the car from various angles -- each picture a 2D array of pixel data -- and process it through software, transforming it into effectively 3D pixels: Splats.

The splats are individual elongated 3D spheres -- thousands to millions of them -- floating in a 3D coordinate space. 3D pixels, essentially. Each with radiance colour properties so they might have different appearances from different angles (e.g. environmental lighting or reflection, etc.)

The magic is obviously figuring out how each pixel in a set of pictures correlates when translated to a 3D space filled with splats. Traditionally it took a load of pictures for it to rationalize, so doing it with two pictures is pretty amazing.

replies(2): >>41369872 #>>41371212 #
11. boltzmann64 ◴[] No.41369665[source]
When you throw a balloon of colored water at a wall, the impression it makes on the wall is called a splat. Say you have a function which takes a point in 3d and outputs a density value which goes to zero as you move away to infinity from the from the functions location (mean) like a bell curve (literally), and you throw (project) that function to plane (your camera film), you get a splat.

Note: I've made some simplifying assumption in the above explanation.

12. CamperBob2 ◴[] No.41369872{3}[source]
So, voxels, then...?
replies(2): >>41370107 #>>41370956 #
13. dimatura ◴[] No.41369971[source]
I'm not a computer graphics expert, but traditionally (since long before the latest 3D gaussian splatting) I've seen splatting used in computer graphics to describes a way of rendering 3D elements onto a 2D canvas with some "transparency", similar to 2D alpha compositing. I think the word derives from "splatter" - like what happens when you throw a tomato against a wall, except here you're throwing some 3D entity onto the camera plane. In the current context of 3D gaussian splatting, the entities that are splatted are 3D gaussians, and the parameters of those 3D gaussians are inferred with optimization at run time and/or predicted from a trained model.
14. llm_nerd ◴[] No.41370107{4}[source]
Similar, but with the significant difference that splats are elongated spheres with variable orientation and elongation. Voxels are fixed sized, fixed orientation cubes. Splatting can be much more efficient for many scenarios than voxels.
15. kridsdale3 ◴[] No.41370950{3}[source]
Damn, I'm lookin' good.
16. kridsdale3 ◴[] No.41370956{4}[source]
Fuzzy, round-ish voxels.
17. teqsun ◴[] No.41371177[source]
Just to check my understanding, the novel part of this is the fact that it generates it from two pictures from any camera without custom hand-calibration for that particular camera, and everything else involved is existing technology?
replies(1): >>41375705 #
18. bredren ◴[] No.41371212{3}[source]
What do you call the processing after a splat, that identifies what's in the model and generates what should exist on the other side?
19. refibrillator ◴[] No.41371457[source]
Novel view synthesis via 3DGS requires knowledge of the camera pose for every input image, ie the cam position and orientation in 3D space.

Historically camera poses have been estimated via 2D image matching techniques like SIFT [1], through software packages like COLMAP.

These algorithms work well when you have many images that methodically cover a scene. However they often struggle to produce accurate estimates in the few image regime, or “in the wild” where photos are taken casually with less rigorous scene coverage.

To address this, a major trend in the field is to move away from classical 2D algorithms, instead leveraging methods that incorporate 3D “priors” or knowledge of the world.

To that end, this paper builds heavily on MASt3R [2], which is a vision transformer model that has been trained to reconstruct a 3D scene from 2D image pairs. The authors added another projection head to output the initial parameters for each gaussian primitive. They further optimize the gaussians through some clever use of the original image pair and randomly selected and rendered novel views, which is basically the original 3DGS algorithm but using synthesized target images instead (hence “zero-shot” in the title).

I do think this general approach will dominate the field in the coming years, but it brings its own unique challenges.

In particular, the quadratic time complexity of transformers is the main computational bottleneck preventing this technique from being scaled up to more than two images at a time, and to resolutions beyond 512 x 512.

Also, naive image matching itself has quadratic time complexity, which is really painful with large dense latent vectors and can’t be accelerated with kd-trees due to the curse of dimensionality. That’s why the authors use a hierarchical coarse to fine algorithm that approximates the exact computation and achieves linear time complexity wrt to image resolution.

[1] https://en.m.wikipedia.org/wiki/Scale-invariant_feature_tran...

[2] https://github.com/naver/mast3r

20. recursive ◴[] No.41372402{3}[source]
I think most video game mirrors use basically the same technique.
replies(1): >>41375766 #
21. dagmx ◴[] No.41372531{4}[source]
That’s not really novel either imho, though google search is escaping me on the specific papers I saw at siggraph.

Imho it’s an interesting combination of technologies but not novel in an off itself.

22. dagmx ◴[] No.41372536{4}[source]
A lot of splats systems do work on uncalibrated images so that’s not novel either. They all just do a camera solve, which arguable isn’t terrible for a stereo pair with low divergence.
23. HappMacDonald ◴[] No.41372857[source]
Would love to see it try to handle a scene where the real volume behind the mirror were also available then. :9
24. ghostly_s ◴[] No.41374230[source]
I wish they would just link to prerendered output from the examples. I can never get output from demos on huggingface successfully , I assume due to load throttling when projects get attention?
25. vessenes ◴[] No.41375705[source]
Good question! “Existing technology” is doing a lot of work here, enough that I would say no.

To get splats out of photos generally requires many calibrated photos all of which include highly accurate placement and orientation data attached. Like ideally cm-accurate 6 degree, plus as you mention calibration.

There are some pipelines that infer this data, and there is some work showing that models can be trained to do well with fuzzy/inaccurate position information. But they still would want like 30 photos + to do that desk demo video justice and look good.

Splatt3R (sigh on typing that name, geez) is a very different architecture, in that it’s a transformer model which is trained to ‘know’ about the world, and takes two (only) photos, combines that with its trained-in world knowledge, and hallucinates (infers) a set of Gaussians that it believes are a sort of plausible continuation from those two photos.

26. vessenes ◴[] No.41375736[source]
Think of a single splat as a Gaussian ellipsoid placed in (usually) 3 dimensions, with size, shape, alpha falloff, color (varying both at any point in the ellipsoid, and when viewed from any angle at that point.)

Now think to yourself “Could I approximate, say Kirby with 10 splats? And could I get a GPU to hone in on the best splats to approximate Kirby, maybe using gradient descent?” Then ask yourself “could I get a 4k+ resolution photographic grade 3D scene that included transmissive and reflective behavior using this method?”

If your answer to the second is “obviously!” Then you have a good head for this ML stuff. Somewhat surprisingly (shockingly?) The answer is ‘yes’, and also somewhat surprisingly, you can use ML pipelines and autograd type tech to hone in on what the 18 or so implied variables for a splat should be, and when you have millions of them, they look like AMAZING reconstructions of scenes. And also are incredibly quick to render.

Splats are pretty cool.

27. HexDecOctBin ◴[] No.41375766{4}[source]
Nowadays, reflections can be done through raytracing, with perhaps a blur/smudge to hide low ray counts for performance.