(nvlabs.github.io)

221 points Vt71fcAqt7 | 2 comments | 16 Oct 24 14:56 UTC | HN request time: 1.037s | source

1. cpldcpu ◴[16 Oct 24 22:59 UTC] No.41864743[source]▶

>We introduce a new Autoencoder (AE) that aggressively increases the scaling factor to 32. Compared with AE-F8, our AE-F32 outputs 16× fewer latent tokens,

Basically they compress/decompress the images more, which means they need less computation during generation. But on the flip side this should mean less variability.

Isn't this more of a design trade-off than an optimization?

replies(1): >>41865236 #

2. Lerc ◴[17 Oct 24 00:23 UTC] No.41865236[source]▶

>>41864743 (TP) #

It might not be compressing more (haven't yet looked at the paper). You can have fewer but larger tokens for the same amount of data.

It would decrease the workload by having fewer things to compare against balanced against workload per comparison. For normal N² that makes sense but the page says.

We introduce a new linear DiT, replacing vanilla quadratic attention and reducing complexity from O(N²) to O(N) Mix-FFN

So not sure what's up there.

↑

Efficient high-resolution image synthesis with linear diffusion transformer