Efficient high-resolution image synthesis with linear diffusion transformer

1. cube2222 ◴[16 Oct 24 17:58 UTC] No.41861846[source]▶

This looks like quite a huge breakthrough, unless I'm missing something?

~25x faster performance than Flux-dev, while offering comparable quality in benchmarks. And visually the examples (surely cherry-picked, but still) look great!

Especially since with GenAI the best way to get good results is to just generate a large amount of them and pick the best (imo). Performance like this will make that much easier/faster/cheaper.

Code is unfortunately "(Coming soon)" for now. Can't wait to play with it!

replies(4): >>41861942 #>>41863225 #>>41864501 #>>41865018 #

2. liuliu ◴[16 Oct 24 18:09 UTC] No.41861942[source]▶

>>41861846 (TP) #

If you read closer to the benchmark, it seems to be slightly worse than FLUX [dev] on prompt adherence and quality. However, the best is to evaluate the result oneself, and the track-record of PixArt Sigma (from the same author?) is pretty good!

3. Archit3ch ◴[16 Oct 24 20:02 UTC] No.41863225[source]▶

>>41861846 (TP) #

If you generate 25x more images, you can afford to cherry-pick.

replies(2): >>41863739 #>>41864455 #

4. cube2222 ◴[16 Oct 24 20:50 UTC] No.41863739[source]▶

>>41863225 #

It would be interesting to have benchmarks that take this into account (maybe they already do or I’m misunderstanding how those benchmarks work). I.e. when comparing quality between two different models of vastly different performance, you could be doing best-of-n in the faster model.

replies(1): >>41863919 #

5. Vt71fcAqt7 ◴[16 Oct 24 21:14 UTC] No.41863919{3}[source]▶

>>41863739 #

That sounds like it could be an intiresting metric. Worth noting that there is a difference between an algorithmic "best of n" selection (via eg. an FID score) vs. manual cherry picking which takes more factors into account such as user preference and also takes time to evaluate, which is what GP was suggesting.

replies(2): >>41864044 #>>41869835 #

6. cube2222 ◴[16 Oct 24 21:31 UTC] No.41864044{4}[source]▶

>>41863919 #

Yeah I’d likely just pick the best scoring one (that is, the pick is made by the evaluation tool, not the model) - to simulate “whatever the receiver deemed best for what they wanted”.

7. Lerc ◴[16 Oct 24 22:24 UTC] No.41864455[source]▶

>>41863225 #

That transfers computer time to user time. It's great when you want variations, less so when you want precision and consistency. Picking the best image tires the brain quite quickly, you have to take into account the at a glance quality without it overriding the detail quality.

I'd be curious to see how a vision model would go if it were finetuned to select the best image match to a given criteria.

It's possible that you could do O1 style training to build a final stage auto-cherrypicker.

8. Lerc ◴[16 Oct 24 22:30 UTC] No.41864501[source]▶

>>41861846 (TP) #

>This looks like quite a huge breakthrough, unless I'm missing something?

Looking at their methodology, it seems like it's more of an accumulation of existing good ideas into the one model.

If it performs as well as they say, perhaps you can say the breakthrough is discovering just how much can be gained by combining recent advances.

It's sitting on just the edge of sounding too good to be true to me. I will certainly be pleased if it holds up to scrutiny.

9. godelski ◴[16 Oct 24 23:42 UTC] No.41865018[source]▶

>>41861846 (TP) #

  > surely cherry-picked

As someone who works in generative vision, this is one of the most frustrating aspects (especially for those with less GPU resources). There's been a silent competition for picking the best images and not showing random results (even when there are random results they may be a selected batch). So it is hard to judge actual quality until you can play around.

Also, I'm not sure what laptop that is but they say 0.37s to generate a 1024x1024 image on a 4090. They also mention that it requires 16GB VRAM. But that laptop looks like a MSI Titan, which has a 4090, and correct me if I'm wrong, but I think the 4090 is the only mobile card with 16GB?[0] (I know desktop graphics have 16 for most cards). The laptop demo takes 4s to generate a 1024x1024 image. But they are chopped down quite a bit[1]

I wonder if that's with or without TensorRT

[0] https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...

[1] https://gpu.userbenchmark.com/Compare/Nvidia-RTX-4090-Laptop...

replies(3): >>41865131 #>>41867104 #>>41868207 #

10. zamadatix ◴[17 Oct 24 00:00 UTC] No.41865131[source]▶

>>41865018 #

The GeForce RTX 3080 Mobile and GeForce RTX 3080 Ti Mobile also have 16 GB versions as noted directly above the linked section on [0].

replies(1): >>41865443 #

11. godelski ◴[17 Oct 24 01:03 UTC] No.41865443{3}[source]▶

>>41865131 #

Thanks! I forgot about that (usually mobile cards have less VRAM, not more lol). I don't necessarily doubt the paper's generation claim, but there are of course many factors that could help clarify what that number actually represents

12. bemmu ◴[17 Oct 24 06:58 UTC] No.41867104[source]▶

>>41865018 #

0.37s is only 11x away from realtime 30fps. I wonder if that will enable some cool new popular application for it besides batch image generation.

replies(1): >>41867207 #

13. godelski ◴[17 Oct 24 07:19 UTC] No.41867207{3}[source]▶

>>41867104 #

You can do much much better with GANs at that resolution. I'm sure you could combine the two for upsampling

14. noduerme ◴[17 Oct 24 10:29 UTC] No.41868207[source]▶

>>41865018 #

Truthfully, I've had astonishing results from Stable Diffusion 1.4 on an M1 Mac, given the right inputs ...enough to throw my hands up and declare it a sort of magic (except for the presence of Getty Images watermarks randomly scattered around my results).

Nonetheless, as an art director, nothing I'd put into production. I guess that's because what I'm focused on is tickling the client base with something original.

replies(2): >>41870367 #>>41871491 #

15. psb217 ◴[17 Oct 24 14:08 UTC] No.41869835{4}[source]▶

>>41863919 #

This is a bit pedantic, but FID score wouldn't really be a viable metric for best of n selection since it's a metric that's only computable for distributions of samples. FID score is also pretty high variance for small sample sizes, so you need a lot of samples to compute a meaningful FID score.

Better metrics (assuming goal is text->image) would be some sort of inception score or CLIP-based text matching score. These metrics are computable on single samples.

16. sdenton4 ◴[17 Oct 24 15:06 UTC] No.41870367{3}[source]▶

>>41868207 #

Maybe add 'watermark' to the negative prompt?

replies(1): >>41876730 #

17. godelski ◴[17 Oct 24 17:07 UTC] No.41871491{3}[source]▶

>>41868207 #

Magic in what way? They sure are impressive tools but like all AI, they do not have an eye for finer detail. I'm really unsure most ML researchers have an eye for this oddly. But then again, most people I know that work in generative vision have no artistic hobby so I'm not sure how they can feel they can properly evaluate works. It's the subtle details that matter.

replies(1): >>41876702 #

18. noduerme ◴[18 Oct 24 05:58 UTC] No.41876702{4}[source]▶

>>41871491 #

I should've been clearer, really. What made me feel the "magic" was not prompting Stable Diffusion. It was letting it iterate on art I had already done.

I did a lot of 3D-rendered illustration back in the 1990s and early 2000s, necessarily low-polygon stuff, but things that were supposed to be life-like scenes, with tons of textures, that took a very long time to render. Including what may have been the first and only children's book illustrated with Infini-D on a Mac IIsi.

So, feeding these old renderings into StableDif with 75% bias toward the original image and a couple of basic prompts, produced results that blew my mind. It was like seeing what my illustrations could have been if I'd had a team at Pixar Studios refining them. In the sense that it was still my character art and my creation, totally recognizable, but polished and refined to a level that would have been unimaginable in 1997.

19. noduerme ◴[18 Oct 24 06:07 UTC] No.41876730{4}[source]▶

>>41870367 #

I prefer to know if I've stolen someone's art...? Taking the un-watermarked parts of a stolen image is kinda sticking your head in the sand. It's recycled, stolen art. I'd never use it in production, even if only to spare my clients potential copyright liability issues down the road. But it's still fantastic as a visualization tool.