The VAE Used for Stable Diffusion Is Flawed

I’ve done a lot of experiments with latent diffusion and also discovered a few flaws in the SD VAE’s training and architecture, which have hardly no attention brought to them. This is concerning as the VAE is a crucial competent when it comes to image quality and is responsible for many of the artefacts associated with AI generated imagery, and no amount of training the diffusion model will fix them.

A few I’ve seen are:

- The goal should be to have latent outputs as closely resemble gaussian distributed terms between -1 and 1 with a variance of 1, but the outputs are unbounded (you could easily clamp or apply tanh to force them to be between -1 and 1), and the KL loss weight is too low, hence why the latents are weighed by a magic number to more closely fit the -1 to 1 range before being invested by the diffusion model.

- To decrease the computational load of the diffusion model, you should reduce the spatial dimensions of the input - having a low number of channels is irrelevant. The SD VAE turns each 8x8x3 block into a 1x1x4 block when it could be turning it into a 1x1x8 (or even higher) block and preserve much more detail at basically 0 computational cost, since the first operation the diffusion model does is apply a convolution to greatly increase the number of channels.

- The discriminator is based on a tiny PatchGAN, which is an ancient model by modern standards. You can have much better results by applying some of the GAN research of the last few years, or of course using a diffusion decoder which is then distilled either with consistency or adversarial distillation.

- KL divergence in general is not even the most optimal way to achieve the goals of a latent diffusion model’s VAE, which is to decrease the spatial dimensions of the input images and have a latent space that’s robust to noise and local perturbations. I’ve had better results with a vanilla AE, clamping the outputs, having a variance loss term and applying various perturbations to the latents before they are ingested by the decoder.

Everything you've said is _intuitively_ correct, but empirically wrong. I've experimented with training VAEs for audio diffusion for the last few months and here's what I found:

- Although the best results for a stand-alone VAE might require increasing the KL loss weight as high as you can to reach an isotropic gaussian latent space without compromising reconstruction quality, beyond a certain point this actually substantially decreases the ability of the diffusion model to properly interpret the latent space and degrades generation quality. The motivation behind constraining the KL loss weight is to ensure the VAE only provides _perceptual_ compression, which VAEs are quite good at, not _semantic_ compression, for which VAEs are a poor generative model compared to diffusion. This is explained in the original latent diffusion paper on which Stable Diffusion was based: https://arxiv.org/pdf/2112.10752.pdf

- You're correct that trading dimensions for channels is a very easy way to increase reconstruction quality of a stand-alone VAE, but it is a very poor choice when the latents are going into a diffusion model. This again makes the latent space harder for the diffusion model to interpret, and again isn't needed if the VAE is strictly operating in the perceptual compression regime as opposed to the semantic compression regime. The underlying reason is channel-wise degrees of freedom have no inherent structure imposed by the underlying convolutional network; in the limit where you hypothetically compress dimensions to a single point with a large number of channels the latent space is completely unstructured and the entropy of the latents is fully maximized; there are no patterns left whatsoever for the diffusion model to work with.

TLDR: Designing VAEs for latent diffusion has a different set of design constraints than designing a VAE as a stand-alone generative model.