The VAE Used for Stable Diffusion Is Flawed

(old.reddit.com)

268 points prashp | 1 comments | 01 Feb 24 12:25 UTC | HN request time: 0.432s | source

Show context

joefourier ◴[01 Feb 24 14:00 UTC] No.39215949[source]▶

I’ve done a lot of experiments with latent diffusion and also discovered a few flaws in the SD VAE’s training and architecture, which have hardly no attention brought to them. This is concerning as the VAE is a crucial competent when it comes to image quality and is responsible for many of the artefacts associated with AI generated imagery, and no amount of training the diffusion model will fix them.

A few I’ve seen are:

- The goal should be to have latent outputs as closely resemble gaussian distributed terms between -1 and 1 with a variance of 1, but the outputs are unbounded (you could easily clamp or apply tanh to force them to be between -1 and 1), and the KL loss weight is too low, hence why the latents are weighed by a magic number to more closely fit the -1 to 1 range before being invested by the diffusion model.

- To decrease the computational load of the diffusion model, you should reduce the spatial dimensions of the input - having a low number of channels is irrelevant. The SD VAE turns each 8x8x3 block into a 1x1x4 block when it could be turning it into a 1x1x8 (or even higher) block and preserve much more detail at basically 0 computational cost, since the first operation the diffusion model does is apply a convolution to greatly increase the number of channels.

- The discriminator is based on a tiny PatchGAN, which is an ancient model by modern standards. You can have much better results by applying some of the GAN research of the last few years, or of course using a diffusion decoder which is then distilled either with consistency or adversarial distillation.

- KL divergence in general is not even the most optimal way to achieve the goals of a latent diffusion model’s VAE, which is to decrease the spatial dimensions of the input images and have a latent space that’s robust to noise and local perturbations. I’ve had better results with a vanilla AE, clamping the outputs, having a variance loss term and applying various perturbations to the latents before they are ingested by the decoder.

replies(6): >>39216175 #>>39216367 #>>39216653 #>>39217093 #>>39219506 #>>39316949 #

smrtinsert ◴[01 Feb 24 18:39 UTC] No.39219506[source]▶

>>39215949 #

How do I get your smarts! I want to understand this stuff desperately.

replies(2): >>39219643 #>>39220027 #

brynbryn ◴[01 Feb 24 19:22 UTC] No.39220027[source]▶

>>39219506 #

These are very fine ways of explaining simple things in an ego-boosting manner. The more you work with ML these days the more you appreciate it. It happens with every new technology bubble.

In regular terms he's saying the outputs aren't coming out in the same dimensions that the next stages cn work with properly. It wants values between -1 and +1 and it isn't guaranteeing it. Then he's saying you can make it quicker to process by putting the data into a more compact structure for the next stage.

The discriminator could be improved. i.e we could capture better input

KL Diversion is not an accurate tool for manipulating the data, and we have better.

ML is a huge pot of turning regular computer science and maths into intelligible papers. If you'd like assurance, look up something like MinMax functions and Sigmoids. You've likely worked with these since you progressed from HelloWorld.cpp but wouldn't care to shout about them in public

replies(1): >>39228200 #

1. adammarples ◴[02 Feb 24 13:19 UTC] No.39228200[source]▶

>>39220027 #

I thought that it was a very clear explanation that I appreciated, I didn't detect any ego boosting nonsense

↑