>We introduce a new Autoencoder (AE) that aggressively increases the scaling factor to 32. Compared with AE-F8, our AE-F32 outputs 16× fewer latent tokens,
Basically they compress/decompress the images more, which means they need less computation during generation. But on the flip side this should mean less variability.
Isn't this more of a design trade-off than an optimization?
replies(1):