The VAE Used for Stable Diffusion Is Flawed

(old.reddit.com)

268 points prashp | 3 comments | 01 Feb 24 12:25 UTC | HN request time: 0.712s | source

Show context

wokwokwok ◴[01 Feb 24 13:39 UTC] No.39215786[source]▶

> It's a spot where the VAE is trying to smuggle global information about the image through latent space. This is exactly the problem that KL-divergence loss is supposed to prevent.

Is that what KL divergence does?

I thought it was supposed to (when combined with reconstruction loss) “smooth” the latent space out so that you could interpolate over it.

Doesn’t increasing the weight of the KL term just result in random output in the latent; eg. What you get if you opt purely for KL divergence?

I honestly have no idea at all what the OP has found or what it means, but it doesnt seem that surprising that modifying the latent results in global changes in the output.

Is manually editing latents a thing?

Surely you would interpolate from another latent…? And if the result is chaos, you dont have well clustered latents? (Which is what happens from too much KL, not too little right?)

I'd feel a lot more 'across' this if the OP had demonstrated it on a trivial MNIST vae with both the issue, the result and quantitatively what fixing it does.

> What are the implications?

> Somewhat subtle, but significant.

Mm. I have to say I don't really get it.

replies(3): >>39215830 #>>39215897 #>>39215922 #

1. GaggiX ◴[01 Feb 24 13:56 UTC] No.39215922[source]▶

>>39215786 #

>I honestly have no idea at all what the OP has found or what it means, but it doesnt seem that surprising that modifying the latent results in global changes in the output.

It only happens in one specific spot: https://i.imgur.com/8DSJYPP.png and https://i.imgur.com/WJsWG78.png. The fact that a single spot in the latent has such a huge impact on the whole image is not a good thing, because the diffusion model will treat that area as equal to the rest of the latent, without giving it more importance. The loss of the diffusion model is applied at the latent level, not the pixel level, so that you don't have to propagate the gradient of the VAE decoder during the training of the diffusion model, so it's unaware of the importance of that spot in the resulting image.

replies(1): >>39216116 #

2. wokwokwok ◴[01 Feb 24 14:13 UTC] No.39216116[source]▶

>>39215922 (TP) #

Not arguing that; I'm just saying I don't know that KL divergence does or is responsible for this, and I haven't seen any compelling argument that increasing the KL term would fix it.

There's no question the OP found a legit issue. The questions are more like:

1) What caused it?

2) How do you fix it?

3) What result would fixing it actually have?

replies(1): >>39216928 #

3. ◴[01 Feb 24 15:25 UTC] No.39216928[source]▶

>>39216116 #

↑