The VAE Used for Stable Diffusion Is Flawed

(old.reddit.com)

268 points BrokenCogs | 2 comments | 01 Feb 24 12:25 UTC | HN request time: 0.41s | source

Show context

joefourier ◴[01 Feb 24 14:00 UTC] No.39215949[source]▶

I’ve done a lot of experiments with latent diffusion and also discovered a few flaws in the SD VAE’s training and architecture, which have hardly no attention brought to them. This is concerning as the VAE is a crucial competent when it comes to image quality and is responsible for many of the artefacts associated with AI generated imagery, and no amount of training the diffusion model will fix them.

A few I’ve seen are:

- The goal should be to have latent outputs as closely resemble gaussian distributed terms between -1 and 1 with a variance of 1, but the outputs are unbounded (you could easily clamp or apply tanh to force them to be between -1 and 1), and the KL loss weight is too low, hence why the latents are weighed by a magic number to more closely fit the -1 to 1 range before being invested by the diffusion model.

- To decrease the computational load of the diffusion model, you should reduce the spatial dimensions of the input - having a low number of channels is irrelevant. The SD VAE turns each 8x8x3 block into a 1x1x4 block when it could be turning it into a 1x1x8 (or even higher) block and preserve much more detail at basically 0 computational cost, since the first operation the diffusion model does is apply a convolution to greatly increase the number of channels.

- The discriminator is based on a tiny PatchGAN, which is an ancient model by modern standards. You can have much better results by applying some of the GAN research of the last few years, or of course using a diffusion decoder which is then distilled either with consistency or adversarial distillation.

- KL divergence in general is not even the most optimal way to achieve the goals of a latent diffusion model’s VAE, which is to decrease the spatial dimensions of the input images and have a latent space that’s robust to noise and local perturbations. I’ve had better results with a vanilla AE, clamping the outputs, having a variance loss term and applying various perturbations to the latents before they are ingested by the decoder.

replies(6): >>39216175 #>>39216367 #>>39216653 #>>39217093 #>>39219506 #>>39316949 #

meowface ◴[01 Feb 24 14:57 UTC] No.39216653[source]▶

>>39215949 #

Is anyone actively working on new models that take these (and the issue raised in the link) into account?

replies(2): >>39217419 #>>39218978 #

refulgentis ◴[01 Feb 24 16:06 UTC] No.39217419[source]▶

>>39216653 #

Yes: from TFA, SD XL released some months ago uses a new VAE.

n.b. clarifying because most of the top comments currently are recommending this person is hired / inquiring if anyones begun work to leverage their insights: they're discussing known issues in a 2 year old model as if it was newly discovered issues in a recent model. (TFA points this out as well)

replies(1): >>39218520 #

joefourier ◴[01 Feb 24 17:25 UTC] No.39218520[source]▶

>>39217419 #

The SD-XL VAE doesn’t take into account any of those insights, it’s the exact same as the SD1/2 one, just trained from scratch with a batch size of 256 instead of 9 and with EMA.

replies(1): >>39218940 #

refulgentis ◴[01 Feb 24 17:55 UTC] No.39218940[source]▶

>>39218520 #

No. Idk where you got this idea.

replies(2): >>39219372 #>>39219401 #

jamilton ◴[01 Feb 24 18:27 UTC] No.39219372[source]▶

>>39218940 #

Can someone provide evidence one way or the other? I don’t know enough to do it myself.

replies(1): >>39220311 #

1. refulgentis ◴[01 Feb 24 19:44 UTC] No.39220311[source]▶

>>39219372 #

c.f. https://news.ycombinator.com/item?id=39220027, or TFA*. They're doing a gish gallop, and I can't really justify burning more karma to poke holes in a stranger's overly erudite tales. I swing about 8 points to the negative when they reply with more.

* multiple sources including OP:

"The SDXL VAE of the same architecture doesn't have this problem,"

"If future models using KL autoencoders do not use the pretrained CompVis checkpoints and use one like SDXL's that is trained properly, they'll be fine."

"SDXL is not subject to this issue because it has its own VAE, which as far as I can tell is trained correctly and does not exhibit the same issues."

replies(1): >>39236156 #

2. joefourier ◴[02 Feb 24 23:56 UTC] No.39236156[source]▶

>>39220311 (TP) #

I think you must have misunderstood me, I didn’t say the SD-XL VAE had the same issue as in OP. What I said was that it didn’t take into account some of my points that came up during my research:

- Bounding the outputs to -1, 1 and optimising the variance directly to make it approach 1

- Increasing the number of channels to 8, as the spatial resolution reduction is most important for latent diffusion

- Using a more modern discriminator architecture instead of PatchGAN’s

- Using a vanilla AE with various perturbations instead of KL divergence

Now SD-XL’s VAE is very good and superior to its predecessor, on account of an improved training procedure, but it didn’t use any of the above tricks. It may even be the case that they would have made no difference in the end - they were useful to me in the context of training models with limited compute.

↑