Show HN: I invented a new generative model and got accepted to ICLR

1. f_devd ◴[10 Oct 25 11:13 UTC] No.45537537[source]▶

Pretty interesting architecture, seems very easy to debug, but as a downside you effectively discard K-1 computations at each layer since it's using a sampler rather than a MoE-style router.

The best way I can summarize it is a Mixture-of-Experts combined with an 'x0-target' latent diffusion model. The main innovation is the guided sampler (rather than router) & split-and-prune optimizer; making it easier to train.

replies(2): >>45537638 #>>45540788 #

2. yorwba ◴[10 Oct 25 11:27 UTC] No.45537638[source]▶

>>45537537 (TP) #

Since the sampling probability is 1/K independent of the input, you don't need to compute K different intermediate outputs at each layer during inference, you can instead decide ahead of time which of the outputs you want to use and only compute that one.

(This is mentioned in Q1 in the "Common Questions About DDN" section at the bottom.)

replies(2): >>45539642 #>>45541272 #

3. kevmo314 ◴[10 Oct 25 14:42 UTC] No.45539642[source]▶

>>45537638 #

This is a very clever insight, nice work!

4. ActivePattern ◴[10 Oct 25 16:32 UTC] No.45540788[source]▶

>>45537537 (TP) #

I don't think you've understood the paper.

- There are no experts. The outputs are approximating random samples from the distribution.

- There is no latent diffusion going on. It's using convolutions similar to a GAN.

- At inference time, you select ahead-of-time the sample index, so you don't discard any computations.

replies(1): >>45541861 #

5. crondee ◴[10 Oct 25 17:10 UTC] No.45541272[source]▶

>>45537638 #

you dont get to do that for conditional generation though. When we have a target then we have to generate multiple, pick closest to target, and discard the rest.

6. diyer22 ◴[10 Oct 25 17:59 UTC] No.45541861[source]▶

>>45540788 #

I agree with @ActivePattern and thank you for your help in answering.

Supplement for @f_devd:

During training, the K outputs share the stem feature from the NN blocks, so generating the K outputs costs only a small amount of extra computation. After L2-distance sampling, discarding the other K-1 outputs therefore incurs a negligible cost and is not comparable to discarding K-1 MoE experts (which would be very expensive).