The VAE Used for Stable Diffusion Is Flawed

(old.reddit.com)

268 points prashp | 1 comments | 01 Feb 24 12:25 UTC | HN request time: 0.221s | source

Show context

wokwokwok ◴[01 Feb 24 13:39 UTC] No.39215786[source]▶

> It's a spot where the VAE is trying to smuggle global information about the image through latent space. This is exactly the problem that KL-divergence loss is supposed to prevent.

Is that what KL divergence does?

I thought it was supposed to (when combined with reconstruction loss) “smooth” the latent space out so that you could interpolate over it.

Doesn’t increasing the weight of the KL term just result in random output in the latent; eg. What you get if you opt purely for KL divergence?

I honestly have no idea at all what the OP has found or what it means, but it doesnt seem that surprising that modifying the latent results in global changes in the output.

Is manually editing latents a thing?

Surely you would interpolate from another latent…? And if the result is chaos, you dont have well clustered latents? (Which is what happens from too much KL, not too little right?)

I'd feel a lot more 'across' this if the OP had demonstrated it on a trivial MNIST vae with both the issue, the result and quantitatively what fixing it does.

> What are the implications?

> Somewhat subtle, but significant.

Mm. I have to say I don't really get it.

replies(3): >>39215830 #>>39215897 #>>39215922 #

ants_everywhere ◴[01 Feb 24 13:54 UTC] No.39215897[source]▶

>>39215786 #

I can't comment on what changing the weights of the KL divergence does in this context, but generally

> Is that what KL divergence does?

KL divergence is basically a distance "metric" in the space of probability distributions. If you have two probability distributions A and B, you can ask how similar they are. "Metric" is in scare quotes because you can't actually get a distance function in the usual sense. For example, dist(A,B) != dist(B,A).

If you think about the distribution as giving information about things, then the distance function should say two things are close if they provide similar information and are distant if one provides more information about something than the other.

The comment claims (and I assume they know what they're talking about) that after training we want the KL divergence to be close to a standard Gaussian. So that would mean that our statistical distribution gives roughly the same information as a standard Gaussian. It sounds like this distribution has a whole lot of information in one heavily localized area though (or maybe too little information in that area, I'm not sure which way it goes).

replies(2): >>39216054 #>>39216114 #

1. 317070 ◴[01 Feb 24 14:13 UTC] No.39216114[source]▶

>>39215897 #

Almost. The KL tells you how much additional information/entropy you get from a random sample of your distribution versus the target distribution.

Here, the target distribution is defined as the unit gaussian and this is defined as the point of zero information (the prior). The KL between the output of the encoder and the prior is telling us how much information can flow from the encoder to the decoder. You don't want the KL to be zero, but usually fairly close to zero.

You can think of the KL as the number of bits you would like to compress your image into.

↑