The splats are individual elongated 3D spheres -- thousands to millions of them -- floating in a 3D coordinate space. 3D pixels, essentially. Each with radiance colour properties so they might have different appearances from different angles (e.g. environmental lighting or reflection, etc.)
The magic is obviously figuring out how each pixel in a set of pictures correlates when translated to a 3D space filled with splats. Traditionally it took a load of pictures for it to rationalize, so doing it with two pictures is pretty amazing.
Note: I've made some simplifying assumption in the above explanation.
Now think to yourself “Could I approximate, say Kirby with 10 splats? And could I get a GPU to hone in on the best splats to approximate Kirby, maybe using gradient descent?” Then ask yourself “could I get a 4k+ resolution photographic grade 3D scene that included transmissive and reflective behavior using this method?”
If your answer to the second is “obviously!” Then you have a good head for this ML stuff. Somewhat surprisingly (shockingly?) The answer is ‘yes’, and also somewhat surprisingly, you can use ML pipelines and autograd type tech to hone in on what the 18 or so implied variables for a splat should be, and when you have millions of them, they look like AMAZING reconstructions of scenes. And also are incredibly quick to render.
Splats are pretty cool.