This should be seen in light of the Great Differentiable Convergence™:
NERFs backpropagating pixels colors into the volume, but also semantic information from the image label, embedded from an LLM reading a multimedia document.
Or something like this. Anyway, wanna buy an NVIDIA GPU ;)?