The VAE Used for Stable Diffusion Is Flawed

1. mmastrac ◴[01 Feb 24 15:26 UTC] No.39216929[source]▶

There seems to be a convincing debunking thread on Twitter, but I definitely don't have the chops to evaluate either claim:

https://twitter.com/Ethan_smith_20/status/175306260429219874...

replies(3): >>39217076 #>>39218189 #>>39218991 #

2. bjornsing ◴[01 Feb 24 15:39 UTC] No.39217076[source]▶

>>39216929 (TP) #

Only took a quick glance. But it looks like a good debunking to me, especially the part where he refers to a section of the original paper clearly stating that the VAE has been trained with a 10e-6 weight factor on the KL divergence term.

I think what would happen if this problem was fixed is that the VAE would produce less appealing more blurry images. This is a classic problem with VAEs. So, more mathematically correct, but less visually appealing.

replies(1): >>39219548 #

3. probably_wrong ◴[01 Feb 24 17:02 UTC] No.39218189[source]▶

>>39216929 (TP) #

Link for those without Twitter: https://threadreaderapp.com/thread/1753062604292198740.html

4. smusamashah ◴[01 Feb 24 17:59 UTC] No.39218991[source]▶

>>39216929 (TP) #

Reddit thread by the same person https://www.reddit.com/r/StableDiffusion/comments/1agd5pz/a_...

5. joefourier ◴[01 Feb 24 18:42 UTC] No.39219548[source]▶

>>39217076 #

Not necessarily if the model is trained with an appropriate adversarial loss. The reason that VAEs are blurry isn’t directly because of the KL divergence loss term but because of the L1/L2 loss. Since VAEs sample from a Gaussian distribution, a high KL weight will make the latent of different images overlap (towards a -1 to 1.0 with a variance of 1 and a mean of 0), and the output of the decoder will tend towards the mean of possible values to try and minimise the pixel-wise loss.

With the appropriate GAN loss, you will instead get a plausible sharp image that differs more and more from the original the more you weigh the KL loss term. A classic GAN that samples from the normal distribution in fact has the best possible KL divergence loss and none of the blurriness from a VAE’s pixel based loss.