Efficient high-resolution image synthesis with linear diffusion transformer

(nvlabs.github.io)

221 points Vt71fcAqt7 | 1 comments | 16 Oct 24 14:56 UTC | HN request time: 0.29s | source

Show context

cube2222 ◴[16 Oct 24 17:58 UTC] No.41861846[source]▶

This looks like quite a huge breakthrough, unless I'm missing something?

~25x faster performance than Flux-dev, while offering comparable quality in benchmarks. And visually the examples (surely cherry-picked, but still) look great!

Especially since with GenAI the best way to get good results is to just generate a large amount of them and pick the best (imo). Performance like this will make that much easier/faster/cheaper.

Code is unfortunately "(Coming soon)" for now. Can't wait to play with it!

replies(4): >>41861942 #>>41863225 #>>41864501 #>>41865018 #

Archit3ch ◴[16 Oct 24 20:02 UTC] No.41863225[source]▶

>>41861846 #

If you generate 25x more images, you can afford to cherry-pick.

replies(2): >>41863739 #>>41864455 #

cube2222 ◴[16 Oct 24 20:50 UTC] No.41863739[source]▶

>>41863225 #

It would be interesting to have benchmarks that take this into account (maybe they already do or I’m misunderstanding how those benchmarks work). I.e. when comparing quality between two different models of vastly different performance, you could be doing best-of-n in the faster model.

replies(1): >>41863919 #

Vt71fcAqt7 ◴[16 Oct 24 21:14 UTC] No.41863919[source]▶

>>41863739 #

That sounds like it could be an intiresting metric. Worth noting that there is a difference between an algorithmic "best of n" selection (via eg. an FID score) vs. manual cherry picking which takes more factors into account such as user preference and also takes time to evaluate, which is what GP was suggesting.

replies(2): >>41864044 #>>41869835 #

1. psb217 ◴[17 Oct 24 14:08 UTC] No.41869835[source]▶

>>41863919 #

This is a bit pedantic, but FID score wouldn't really be a viable metric for best of n selection since it's a metric that's only computable for distributions of samples. FID score is also pretty high variance for small sample sizes, so you need a lot of samples to compute a meaningful FID score.

Better metrics (assuming goal is text->image) would be some sort of inception score or CLIP-based text matching score. These metrics are computable on single samples.

↑