The VAE Used for Stable Diffusion Is Flawed

1. joefourier ◴[01 Feb 24 14:00 UTC] No.39215949[source]▶

I’ve done a lot of experiments with latent diffusion and also discovered a few flaws in the SD VAE’s training and architecture, which have hardly no attention brought to them. This is concerning as the VAE is a crucial competent when it comes to image quality and is responsible for many of the artefacts associated with AI generated imagery, and no amount of training the diffusion model will fix them.

A few I’ve seen are:

- The goal should be to have latent outputs as closely resemble gaussian distributed terms between -1 and 1 with a variance of 1, but the outputs are unbounded (you could easily clamp or apply tanh to force them to be between -1 and 1), and the KL loss weight is too low, hence why the latents are weighed by a magic number to more closely fit the -1 to 1 range before being invested by the diffusion model.

- To decrease the computational load of the diffusion model, you should reduce the spatial dimensions of the input - having a low number of channels is irrelevant. The SD VAE turns each 8x8x3 block into a 1x1x4 block when it could be turning it into a 1x1x8 (or even higher) block and preserve much more detail at basically 0 computational cost, since the first operation the diffusion model does is apply a convolution to greatly increase the number of channels.

- The discriminator is based on a tiny PatchGAN, which is an ancient model by modern standards. You can have much better results by applying some of the GAN research of the last few years, or of course using a diffusion decoder which is then distilled either with consistency or adversarial distillation.

- KL divergence in general is not even the most optimal way to achieve the goals of a latent diffusion model’s VAE, which is to decrease the spatial dimensions of the input images and have a latent space that’s robust to noise and local perturbations. I’ve had better results with a vanilla AE, clamping the outputs, having a variance loss term and applying various perturbations to the latents before they are ingested by the decoder.

replies(6): >>39216175 #>>39216367 #>>39216653 #>>39217093 #>>39219506 #>>39316949 #

2. exo-pla-net ◴[01 Feb 24 14:19 UTC] No.39216175[source]▶

>>39215949 (TP) #

Sounds like they ought hire you.

3. Cacti ◴[01 Feb 24 14:34 UTC] No.39216367[source]▶

>>39215949 (TP) #

All your points are good ones and were knowable by any researcher at the time who wasn’t, idk, a new grad or new to CV. I always assumed they just threw the VAE in there using the default options from the original VAE paper and never thought about it much again, or never looked into it due to the training cost (for hyperparam search, mainly). I don’t remember most of the points you raised being common knowledge when the VAE paper came out, but they certainly were when the stable diffusion paper came out.

replies(1): >>39216692 #

4. meowface ◴[01 Feb 24 14:57 UTC] No.39216653[source]▶

>>39215949 (TP) #

Is anyone actively working on new models that take these (and the issue raised in the link) into account?

replies(2): >>39217419 #>>39218978 #

5. michaelt ◴[01 Feb 24 15:02 UTC] No.39216692[source]▶

>>39216367 #

> All your points are good ones and were knowable by any researcher at the time who wasn’t, idk, a new grad or new to CV.

I think you are radically overstating how obvious some of these things are.

What you call "just threw the VAE in there using the default options from the original VAE paper" is what another person might call "used a proven reference implementation, with the settings recommended by its creator"

Sure, there are design flaws with SD1.0 which feel obvious today - they've published SDXL and having read the paper, I wouldn't even consider going about such a project without "Conditioning the Model on Cropping Parameters". But the truth is this stuff is only obvious to me because someone else figured it out and told me.

replies(1): >>39220850 #

6. nullc ◴[01 Feb 24 15:41 UTC] No.39217093[source]▶

>>39215949 (TP) #

I would assume there is not much attention because better results come from just dropping the VAE entirely unless you're chasing a small resource bound, but most of the research interest is in state of the art work which is hardly resources bounded.

replies(1): >>39218798 #

7. refulgentis ◴[01 Feb 24 16:06 UTC] No.39217419[source]▶

>>39216653 #

Yes: from TFA, SD XL released some months ago uses a new VAE.

n.b. clarifying because most of the top comments currently are recommending this person is hired / inquiring if anyones begun work to leverage their insights: they're discussing known issues in a 2 year old model as if it was newly discovered issues in a recent model. (TFA points this out as well)

replies(1): >>39218520 #

8. joefourier ◴[01 Feb 24 17:25 UTC] No.39218520{3}[source]▶

>>39217419 #

The SD-XL VAE doesn’t take into account any of those insights, it’s the exact same as the SD1/2 one, just trained from scratch with a batch size of 256 instead of 9 and with EMA.

replies(1): >>39218940 #

9. godelski ◴[01 Feb 24 17:45 UTC] No.39218798[source]▶

>>39217093 #

There are plenty of works showing diffusion with other backbones, ViT is the easiest to find.

10. refulgentis ◴[01 Feb 24 17:55 UTC] No.39218940{4}[source]▶

>>39218520 #

No. Idk where you got this idea.

replies(2): >>39219372 #>>39219401 #

11. topwalktown ◴[01 Feb 24 17:58 UTC] No.39218978[source]▶

>>39216653 #

yeah, check out the Emu paper by meta. They basically do all of what is mentioned in the above comment

12. jamilton ◴[01 Feb 24 18:27 UTC] No.39219372{5}[source]▶

>>39218940 #

Can someone provide evidence one way or the other? I don’t know enough to do it myself.

replies(1): >>39220311 #

13. joefourier ◴[01 Feb 24 18:30 UTC] No.39219401{5}[source]▶

>>39218940 #

From the SD-XL paper:

> To this end, we train the same autoencoder architecture used for the original Stable Diffusion at a larger batch-size (256 vs 9) and additionally track the weights with an exponential moving average. The resulting autoencoder outperforms the original model in all evaluated reconstruction metrics

And if you look at the SD-XL VAE config file, it has a scaling factor of 0.13025 while the original SD VAE had one of 0.18215 - so meaning it was also trained with an unbounded output. The architecture is also the exact same if you inspect the model file.

But if you have any details about the training procedure of the new VAE that they didn’t include in the paper, feel free to link to them, I’d love to take a look.

14. smrtinsert ◴[01 Feb 24 18:39 UTC] No.39219506[source]▶

>>39215949 (TP) #

How do I get your smarts! I want to understand this stuff desperately.

replies(2): >>39219643 #>>39220027 #

15. pas ◴[01 Feb 24 18:51 UTC] No.39219643[source]▶

>>39219506 #

it takes time, work, lots of trial-and-error to find which learning style works best for you.

https://www.youtube.com/watch?v=vJo7hiMxbQ8 autoencoders

https://www.youtube.com/watch?v=x6T1zMSE4Ts NVAE: A Deep Hierarchical Variational Autoencoder

https://www.youtube.com/watch?v=eyxmSmjmNS0 GAN paper

and then of course you need to check the Stable Diffusion architecture.

oh, also lurking on Reddit to simply see the enormous breadth of ML theory: https://old.reddit.com/r/MachineLearning/search?q=VAE&restri...

and then of course, maybe if someone's nickname has fourier in it, they probably have a sizeable headstart when it comes to math/theory heavy stuff :)

and some hands-on tinkering never hurts! https://towardsdatascience.com/variational-autoencoder-demys...

16. brynbryn ◴[01 Feb 24 19:22 UTC] No.39220027[source]▶

>>39219506 #

These are very fine ways of explaining simple things in an ego-boosting manner. The more you work with ML these days the more you appreciate it. It happens with every new technology bubble.

In regular terms he's saying the outputs aren't coming out in the same dimensions that the next stages cn work with properly. It wants values between -1 and +1 and it isn't guaranteeing it. Then he's saying you can make it quicker to process by putting the data into a more compact structure for the next stage.

The discriminator could be improved. i.e we could capture better input

KL Diversion is not an accurate tool for manipulating the data, and we have better.

ML is a huge pot of turning regular computer science and maths into intelligible papers. If you'd like assurance, look up something like MinMax functions and Sigmoids. You've likely worked with these since you progressed from HelloWorld.cpp but wouldn't care to shout about them in public

replies(1): >>39228200 #

17. refulgentis ◴[01 Feb 24 19:44 UTC] No.39220311{6}[source]▶

>>39219372 #

c.f. https://news.ycombinator.com/item?id=39220027, or TFA*. They're doing a gish gallop, and I can't really justify burning more karma to poke holes in a stranger's overly erudite tales. I swing about 8 points to the negative when they reply with more.

* multiple sources including OP:

"The SDXL VAE of the same architecture doesn't have this problem,"

"If future models using KL autoencoders do not use the pretrained CompVis checkpoints and use one like SDXL's that is trained properly, they'll be fine."

"SDXL is not subject to this issue because it has its own VAE, which as far as I can tell is trained correctly and does not exhibit the same issues."

replies(1): >>39236156 #

18. Cacti ◴[01 Feb 24 20:19 UTC] No.39220850{3}[source]▶

>>39216692 #

I’m not criticizing them or the approach. That’s what I would have done most likely. But the things you mentioned aren’t particular to stable diffusion, or even VAEs. Yes, the best way to learn is to be told or to build up applied/implemen6ation experience until you learn them directly. But almost any CV model will run into at least one of those issues, and I would expect someone with idk > 1y experience in applied work to know these things. Perhaps I am wrong to do that.

19. adammarples ◴[02 Feb 24 13:19 UTC] No.39228200{3}[source]▶

>>39220027 #

I thought that it was a very clear explanation that I appreciated, I didn't detect any ego boosting nonsense

20. joefourier ◴[02 Feb 24 23:56 UTC] No.39236156{7}[source]▶

>>39220311 #

I think you must have misunderstood me, I didn’t say the SD-XL VAE had the same issue as in OP. What I said was that it didn’t take into account some of my points that came up during my research:

- Bounding the outputs to -1, 1 and optimising the variance directly to make it approach 1

- Increasing the number of channels to 8, as the spatial resolution reduction is most important for latent diffusion

- Using a more modern discriminator architecture instead of PatchGAN’s

- Using a vanilla AE with various perturbations instead of KL divergence

Now SD-XL’s VAE is very good and superior to its predecessor, on account of an improved training procedure, but it didn’t use any of the above tricks. It may even be the case that they would have made no difference in the end - they were useful to me in the context of training models with limited compute.

21. parlancex ◴[09 Feb 24 16:51 UTC] No.39316949[source]▶

>>39215949 (TP) #

Everything you've said is _intuitively_ correct, but empirically wrong. I've experimented with training VAEs for audio diffusion for the last few months and here's what I found:

- Although the best results for a stand-alone VAE might require increasing the KL loss weight as high as you can to reach an isotropic gaussian latent space without compromising reconstruction quality, beyond a certain point this actually substantially decreases the ability of the diffusion model to properly interpret the latent space and degrades generation quality. The motivation behind constraining the KL loss weight is to ensure the VAE only provides _perceptual_ compression, which VAEs are quite good at, not _semantic_ compression, for which VAEs are a poor generative model compared to diffusion. This is explained in the original latent diffusion paper on which Stable Diffusion was based: https://arxiv.org/pdf/2112.10752.pdf

- You're correct that trading dimensions for channels is a very easy way to increase reconstruction quality of a stand-alone VAE, but it is a very poor choice when the latents are going into a diffusion model. This again makes the latent space harder for the diffusion model to interpret, and again isn't needed if the VAE is strictly operating in the perceptual compression regime as opposed to the semantic compression regime. The underlying reason is channel-wise degrees of freedom have no inherent structure imposed by the underlying convolutional network; in the limit where you hypothetically compress dimensions to a single point with a large number of channels the latent space is completely unstructured and the entropy of the latents is fully maximized; there are no patterns left whatsoever for the diffusion model to work with.

TLDR: Designing VAEs for latent diffusion has a different set of design constraints than designing a VAE as a stand-alone generative model.