←back to thread

392 points diyer22 | 6 comments | | HN request time: 0.42s | source | bottom

I invented Discrete Distribution Networks, a novel generative model with simple principles and unique properties, and the paper has been accepted to ICLR2025!

Modeling data distribution is challenging; DDN adopts a simple yet fundamentally different approach compared to mainstream generative models (Diffusion, GAN, VAE, autoregressive model):

1. The model generates multiple outputs simultaneously in a single forward pass, rather than just one output. 2. It uses these multiple outputs to approximate the target distribution of the training data. 3. These outputs together represent a discrete distribution. This is why we named it "Discrete Distribution Networks".

Every generative model has its unique properties, and DDN is no exception. Here, we highlight three characteristics of DDN:

- Zero-Shot Conditional Generation (ZSCG). - One-dimensional discrete latent representation organized in a tree structure. - Fully end-to-end differentiable.

Reviews from ICLR:

> I find the method novel and elegant. The novelty is very strong, and this should not be overlooked. This is a whole new method, very different from any of the existing generative models. > This is a very good paper that can open a door to new directions in generative modeling.

1. f_devd ◴[] No.45537537[source]
Pretty interesting architecture, seems very easy to debug, but as a downside you effectively discard K-1 computations at each layer since it's using a sampler rather than a MoE-style router.

The best way I can summarize it is a Mixture-of-Experts combined with an 'x0-target' latent diffusion model. The main innovation is the guided sampler (rather than router) & split-and-prune optimizer; making it easier to train.

replies(2): >>45537638 #>>45540788 #
2. yorwba ◴[] No.45537638[source]
Since the sampling probability is 1/K independent of the input, you don't need to compute K different intermediate outputs at each layer during inference, you can instead decide ahead of time which of the outputs you want to use and only compute that one.

(This is mentioned in Q1 in the "Common Questions About DDN" section at the bottom.)

replies(2): >>45539642 #>>45541272 #
3. kevmo314 ◴[] No.45539642[source]
This is a very clever insight, nice work!
4. ActivePattern ◴[] No.45540788[source]
I don't think you've understood the paper.

- There are no experts. The outputs are approximating random samples from the distribution.

- There is no latent diffusion going on. It's using convolutions similar to a GAN.

- At inference time, you select ahead-of-time the sample index, so you don't discard any computations.

replies(1): >>45541861 #
5. crondee ◴[] No.45541272[source]
you dont get to do that for conditional generation though. When we have a target then we have to generate multiple, pick closest to target, and discard the rest.
6. diyer22 ◴[] No.45541861[source]
I agree with @ActivePattern and thank you for your help in answering.

Supplement for @f_devd:

During training, the K outputs share the stem feature from the NN blocks, so generating the K outputs costs only a small amount of extra computation. After L2-distance sampling, discarding the other K-1 outputs therefore incurs a negligible cost and is not comparable to discarding K-1 MoE experts (which would be very expensive).