The VAE Used for Stable Diffusion Is Flawed

Another example is when people realized that SD v1.5 wasn't able to generate images that were too dark or too bright. The problem in the end was that during training even the noisiest step still has enough signal for the model to be able to detect the mean of the actual image (signal), this is done because you cannot have pure Gaussian noise during training of an epsilon objective model or it will cause a division by zero. Of course during inference there is no signal in the first step, so the model would read the mean of the input (so zero as the input is Gaussian noise) and it will output an image of mean 0.

It's not uncommon to find major problems with these systems, I remember inspecting the VQGAN used by Dalle Mega (the largest version of Dalle Mini) and discovering that the vast majority of entries in the codebook had a magnitude very close to zero, making them completely unusable by the model.