The best way I can summarize it is a Mixture-of-Experts combined with an 'x0-target' latent diffusion model. The main innovation is the guided sampler (rather than router) & split-and-prune optimizer; making it easier to train.
Modeling data distribution is challenging; DDN adopts a simple yet fundamentally different approach compared to mainstream generative models (Diffusion, GAN, VAE, autoregressive model):
1. The model generates multiple outputs simultaneously in a single forward pass, rather than just one output. 2. It uses these multiple outputs to approximate the target distribution of the training data. 3. These outputs together represent a discrete distribution. This is why we named it "Discrete Distribution Networks".
Every generative model has its unique properties, and DDN is no exception. Here, we highlight three characteristics of DDN:
- Zero-Shot Conditional Generation (ZSCG). - One-dimensional discrete latent representation organized in a tree structure. - Fully end-to-end differentiable.
Reviews from ICLR:
> I find the method novel and elegant. The novelty is very strong, and this should not be overlooked. This is a whole new method, very different from any of the existing generative models. > This is a very good paper that can open a door to new directions in generative modeling.
The best way I can summarize it is a Mixture-of-Experts combined with an 'x0-target' latent diffusion model. The main innovation is the guided sampler (rather than router) & split-and-prune optimizer; making it easier to train.
(This is mentioned in Q1 in the "Common Questions About DDN" section at the bottom.)
- There are no experts. The outputs are approximating random samples from the distribution.
- There is no latent diffusion going on. It's using convolutions similar to a GAN.
- At inference time, you select ahead-of-time the sample index, so you don't discard any computations.
Supplement for @f_devd:
During training, the K outputs share the stem feature from the NN blocks, so generating the K outputs costs only a small amount of extra computation. After L2-distance sampling, discarding the other K-1 outputs therefore incurs a negligible cost and is not comparable to discarding K-1 MoE experts (which would be very expensive).