Most active commenters

vunderba(6)
echelon(3)
(3)
kouteiheika(3)

Popular/hot comments

>>46175104 #
>>46177028 #

←back to thread

Z-Image: Powerful and highly efficient image generation model with 6B parameters

(github.com)

1. vunderba ◴[06 Dec 25 17:36 UTC] No.46175068[source]▶

>>46095817 (OP) #

I've done some preliminary testing with Z-Image Turbo in the past week.

Thoughts

- It's fast (~3 seconds on my RTX 4090)

- Surprisingly capable of maintaining image integrity even at high resolutions (1536x1024, sometimes 2048x2048)

- The adherence is impressive for a 6B parameter model

Some tests (2 / 4 passed):

https://imgpb.com/exMoQ

Personally I find it works better as a refiner model downstream of Qwen-Image 20b which has significantly better prompt understanding but has an unnatural "smoothness" to its generated images.

replies(6): >>46175104 #>>46175331 #>>46177028 #>>46177043 #>>46177543 #>>46178707 #

2. echelon ◴[06 Dec 25 17:41 UTC] No.46175104[source]▶

>>46175068 (TP) #

So does this finally replace SDXL?

Is Flux 1/2/Kontext left in the dust by the Z Image and Qwen combo?

replies(3): >>46175236 #>>46175387 #>>46178341 #

3. tripplyons ◴[06 Dec 25 18:00 UTC] No.46175236[source]▶

>>46175104 #

SDXL has been outclassed for a while, especially since Flux came out.

replies(2): >>46175257 #>>46177243 #

4. aeon_ai ◴[06 Dec 25 18:04 UTC] No.46175257{3}[source]▶

>>46175236 #

Subjective. Most in creative industries regularly still use SDXL.

Once Z-image base comes out and some real tuning can be done, I think it has a chance of replacing it for the function SDXL has

replies(2): >>46176045 #>>46178832 #

5. amrrs ◴[06 Dec 25 18:15 UTC] No.46175331[source]▶

>>46175068 (TP) #

On fal, it takes less than a second many times.

https://fal.ai/models/fal-ai/z-image/turbo/api

Couple that with the LoRA, in about 3 seconds you can generate completely personalized images.

The speed alone is a big factor but if you put the model side by side with seedream and nanobanana and other models it's definitely in the top 5 and that's killer combo imho.

replies(1): >>46177047 #

6. vunderba ◴[06 Dec 25 18:21 UTC] No.46175387[source]▶

>>46175104 #

Yeah, I've definitely switched largely away from Flux. Much as I do like Flux (for prompt adherency), BFL's baffling licensing structure along with its excessive censorship makes it a noop.

For ref, the Porcupine-cone creature that ZiT couldn't handle by itself in my aforementioned test was easily handled using a Qwen20b + ZiT refiner workflow and even with two separate models STILL runs faster than Flux2 [dev].

https://imgur.com/a/5qYP0Vc

7. Scrapemist ◴[06 Dec 25 19:43 UTC] No.46176045{4}[source]▶

>>46175257 #

Source?

replies(1): >>46177285 #

8. tarruda ◴[06 Dec 25 22:12 UTC] No.46177028[source]▶

>>46175068 (TP) #

> It's fast (~3 seconds on my RTX 4090)

It is amazing how far behind Apple Silicon is when it comes to use non- language models.

Using the reference code from Z-image on my M1 ultra, it takes 8 seconds per step. Over a minute for the default of 9 steps.

replies(3): >>46177803 #>>46180602 #>>46183922 #

9. nialv7 ◴[06 Dec 25 22:15 UTC] No.46177043[source]▶

>>46175068 (TP) #

China really is keeping the open weight/source AI scene alive. If in five years a consumer GPU market still exists it would be because of them.

replies(1): >>46177814 #

10. venusenvy47 ◴[06 Dec 25 22:15 UTC] No.46177047[source]▶

>>46175331 #

I don't know anything about paying for these services, and as a beginner, I worry about running up a huge bill. Do they let you set a limit on how much you pay? I see their pricing examples, but I've never tried one of these.

https://fal.ai/pricing

replies(2): >>46177422 #>>46179998 #

11. ◴[06 Dec 25 22:41 UTC] No.46177243{3}[source]▶

>>46175236 #

12. echelon ◴[06 Dec 25 22:46 UTC] No.46177285{5}[source]▶

>>46176045 #

Most of the people I know doing local AI prefer SDXL to Flux. Lots of people are still using SDXL, even today.

Flux has largely been met with a collective yawn.

The only thing Flux had going for it was photorealism and prompt adherence. But the skin and jaws of the humans it generated looked weird, it was difficult to fine tune, and the licensing was weird. Furthermore, Flux never had good aesthetics. It always felt plain.

Nobody doing anime or cartoons used Flux. SDXL continues to shine here. People doing photoreal kept using Midjourney.

replies(1): >>46178815 #

13. tethys ◴[06 Dec 25 23:07 UTC] No.46177422{3}[source]▶

>>46177047 #

It works with prepaid credits, so there should be no risk. Minimum credit amount is $10, though.

replies(1): >>46177695 #

14. soontimes ◴[06 Dec 25 23:21 UTC] No.46177543[source]▶

>>46175068 (TP) #

If that’s your website please check GitHub link - it has a typo (gitub) and goes to a malicious site

replies(2): >>46177568 #>>46177584 #

15. ◴[06 Dec 25 23:25 UTC] No.46177568[source]▶

>>46177543 #

16. vunderba ◴[06 Dec 25 23:28 UTC] No.46177584[source]▶

>>46177543 #

Thanks for the heads up. I just checked the site through several browsers and proxying through a VPN. There's no typo and it properly links to:

https://github.com/Tongyi-MAI/Z-Image

Screenshot of site with network tools open to indicate link

https://imgur.com/a/FZDz0K2

EDIT: It's possible that this issue might have existed in an old cached version. I'll purge the cache just to make sure.

replies(1): >>46177640 #

17. rprwhite ◴[06 Dec 25 23:37 UTC] No.46177640{3}[source]▶

>>46177584 #

The link with the typo is in the footer.

replies(1): >>46177659 #

18. vunderba ◴[06 Dec 25 23:40 UTC] No.46177659{4}[source]▶

>>46177640 #

Well holy crap - that's been there for about forever! I need a "domain name" spellchecker built into my Gulp CI/CD flow.

EDIT: Fixed! Thanks soontimes and rprwhite!

19. vunderba ◴[06 Dec 25 23:48 UTC] No.46177695{4}[source]▶

>>46177422 #

This. You can also run most (if not all) of the models that Fal.ai directly from the playground tab including Z-Image Turbo.

https://fal.ai/models/fal-ai/z-image/turbo

20. p-e-w ◴[06 Dec 25 23:57 UTC] No.46177803[source]▶

>>46177028 #

The diffusion process is usually compute-bound, while transformer inference is memory-bound.

Apple Silicon is comparable in memory bandwidth to mid-range GPUs, but it’s light years behind on compute.

replies(1): >>46178177 #

21. p-e-w ◴[06 Dec 25 23:59 UTC] No.46177814[source]▶

>>46177043 #

Pretty sure the consumer GPU market mostly exists because of games, which has nothing to do with China or AI.

replies(2): >>46178443 #>>46178968 #

22. tarruda ◴[07 Dec 25 00:47 UTC] No.46178177{3}[source]▶

>>46177803 #

> but it’s light years behind on compute.

Is that the only factor though? I wonder if pytorch is lacking optimization for the MPS backend.

replies(1): >>46180929 #

23. mythz ◴[07 Dec 25 01:16 UTC] No.46178341[source]▶

>>46175104 #

SDXL has long been surpassed, it's primary redeeming feature is fine tuned variants for different focus and image styles.

IMO HiDream had the best quality OSS generations, Flux Schnell is decent as well. Will try out Z-Image soon.

24. ◴[07 Dec 25 01:41 UTC] No.46178443{3}[source]▶

>>46177814 #

25. rendaw ◴[07 Dec 25 02:45 UTC] No.46178707[source]▶

>>46175068 (TP) #

That's 2/4? The kitkat bars look nothing like kitkat bars for the most part (logo? splits? white cream filling?). The DNA armor is made from normal metal links.

replies(1): >>46178934 #

26. kouteiheika ◴[07 Dec 25 03:09 UTC] No.46178815{6}[source]▶

>>46177285 #

> it was difficult to fine tune

Yep. It's pretty difficult to fine tune, mostly because it's a distilled model. You can fine tune it a little bit, but it will quickly collapse and start producing garbage, even though fundamentally it should have been an easier architecture to fine-tune compared to SDXL (since it uses the much more modern flow matching paradigm).

I think that's probably the reason why we never really got any good anime Flux models (at least not as good as they were for SDXL). You just don't have enough leeway to be able to train the model for long enough to make the model great for a domain it's currently suboptimal for without completely collapsing it.

replies(2): >>46180111 #>>46186850 #

27. CuriouslyC ◴[07 Dec 25 03:13 UTC] No.46178832{4}[source]▶

>>46175257 #

I don't think that's fair. SDXL is crap at composition. It's really good with LoRAs to stylize/inpaint though.

28. vunderba ◴[07 Dec 25 03:36 UTC] No.46178934[source]▶

>>46178707 #

Fair. Nobody said it was going to surpass Flux.1 Dev (a 12B parameter model) or Qwen-Image (a 20B parameter model) where prompt adherence is strictly concerned.

It's the reason I'm holding off until the Z-Image Base version is released before adding to the official GenAI model comparisons.

But for a 6B model that can generate an image in under 5 seconds, it punches far above its weight class.

As to the passing images, there is white chocolate kit-kat (I know, blasphemy, right?).

29. samus ◴[07 Dec 25 03:42 UTC] No.46178968{3}[source]▶

>>46177814 #

The consumer GPU market is not treated as a primary market by GPU makers anymore. Similar to how Micron went B2B-only.

replies(1): >>46181236 #

30. Bombthecat ◴[07 Dec 25 07:48 UTC] No.46179998{3}[source]▶

>>46177047 #

For images I like them: https://runware.ai/ super cheap and super fast, they also support Loras and you can upload your own models.

And you work with credits

replies(1): >>46183028 #

31. magicalhippo ◴[07 Dec 25 08:20 UTC] No.46180111{7}[source]▶

>>46178815 #

> It's pretty difficult to fine tune, mostly because it's a distilled model.

What about being distilled makes it harder to fine-tune?

replies(1): >>46195317 #

32. tails4e ◴[07 Dec 25 10:09 UTC] No.46180602[source]▶

>>46177028 #

I heard last year the potential future of gaming is not rendering but fully AI generated frames. 3 seconds per 'frame' now, it's not hard to believe it could do 60fps in a few short years. It makes it seem more likely such a game could exist. I'm not sure I like the idea, but it seems like it could happen

replies(2): >>46180853 #>>46181090 #

33. wcoenen ◴[07 Dec 25 11:02 UTC] No.46180853{3}[source]▶

>>46180602 #

Increasing the framerate by rendering at a lower resolution + upscaling, or outright generation of extra frames has already been a thing for a few years now. NVidia calls it Deep Learning Super Sampling (DLSS)[1]. AMD's equivalent is called FSR[2].

[1] https://en.wikipedia.org/wiki/Deep_Learning_Super_Sampling

[2] https://en.wikipedia.org/wiki/GPUOpen#FidelityFX_Super_Resol...

34. rfoo ◴[07 Dec 25 11:20 UTC] No.46180929{4}[source]▶

>>46178177 #

This is the only factor. People sometimes perceive Apple's NPU as "fast" and "amazing" which is simply false.

It's just that NVIDIA GPU sucks (relatively) at *single-user* LLM inference and it makes people feel like Apple not so bad.

35. snek_case ◴[07 Dec 25 11:53 UTC] No.46181090{3}[source]▶

>>46180602 #

The problem is going to be how to control those models to produce a universe that's temporally and spatially consistent. Also think of other issues such as networked games, how would you even begin to approach that in this new paradigm? You need multiple models to have a shared representation that includes other players. You need to be able to sync data efficiently across the network.

I get that it's tempting to say "we no longer have to program game engines, hurray", but at the same time, we've already done the work, we already have game engines that are relatively very computationally efficient and predictable. We understand graphics and simulation quite well.

Personally: I think there's an obvious future in using AI tools to generate game content. 3D modelling and animation can be very time consuming. If you could get an AI model to generate animated characters, you could save a lot of time. You could also empower a lot of indie devs who don't have 3D modelers to help them. AI tools to generate large maps, also super valuable. Replacing the game engine itself, I think it's a taller order than people realize, and maybe not actually desirable.

replies(1): >>46181275 #

36. adventured ◴[07 Dec 25 12:24 UTC] No.46181236{4}[source]▶

>>46178968 #

The parent comment of course understands that. Nvidia views the gaming market as an entry threat, a vector from which a competitor can come after their AI GPU market. That's the reason Nvidia won't be looking to exit the gaming scene no matter how large their AI business gets. If done correctly, staying in the gaming GPU market helps to suppress competition.

Exiting the consumer market is likely a mistake by Micron. If China takes that market segment, they'll eventually take the rest, eliminating most of Micron's value. Holding consumer is about keeping entry attacks covered.

replies(1): >>46183145 #

37. adventured ◴[07 Dec 25 12:35 UTC] No.46181275{4}[source]▶

>>46181090 #

20 years out, what will everybody be using routine 10gbps pipes in our homes for?

I'm paying $43 / month for 500mbps at present and there's nothing special about that at all (in the US or globally). What might we finally use 1gbps+ for? Pulling down massive AI-built worlds of entertainment. Movies & TV streaming sure isn't going to challenge our future bandwidth capabilities.

The worlds are built and shared so quickly in the background that with some slight limitations you never notice the world building going on behind the scenes.

The world building doesn't happen locally. Multiple players connect to the same built world that is remote. There will be smaller hobbyist segments that will still world-build locally for numerous reasons (privacy for one).

The worlds can be constructed entirely before they're downloaded. There are good arguments for both approaches (build the entire world then allow it to be accessed, or attempt to world-build as you play). Both will likely be used over the coming decades, for different reasons and at different times (changes in capabilities will unlock new arguments for either as time goes on, with a likely back and forth where one pulls ahead then the other pulls ahead).

38. Bombthecat ◴[07 Dec 25 16:45 UTC] No.46183028{4}[source]▶

>>46179998 #

Why the down vote? Are they scam?

39. CamperBob2 ◴[07 Dec 25 17:03 UTC] No.46183145{5}[source]▶

>>46181236 #

Exiting the consumer market is likely a mistake by Micron.

I actually think their move to shut down the Crucial channel will prove to be a good one. Why? Because we're heading toward a bimodal distribution of outcomes: either the AI bubble won't pop, and it will pay to prioritize the data center customers, or it will pop. In the latter case a consumer/business-facing RAM manufacturer will have to compete with its own surplus/unused product on scales never seen before.

Worst case scenario for Micron/Crucial, all those warehouses full of wafers that Altman has reserved are going to end up back in the normal RAM marketplace anyway. So why not let him foot the bill for fabbing and storing them in the meantime? Seems that the RAM manufacturers are just trying to make the best of a perilous situation.

replies(1): >>46184232 #

40. liuliu ◴[07 Dec 25 18:38 UTC] No.46183922[source]▶

>>46177028 #

Not saying M1 Ultra is great. But you should only get ~8x slow down with proper implementation (such as Draw Things upcoming implementation for Z Image). It should be 2~3 sec per step. On M5 iPad, it is ~6s per step.

41. gunalx ◴[07 Dec 25 19:17 UTC] No.46184232{6}[source]▶

>>46183145 #

But why not just keep the consumer brand until stockpiles empty and blame supply issues until things possibly cool down, or people have forgotten the brand at all.

replies(1): >>46184824 #

42. CamperBob2 ◴[07 Dec 25 20:26 UTC] No.46184824{7}[source]▶

>>46184232 #

I imagine the strategy would get out anyway as soon as retailers tried to place their next round of orders. Might as well get out in front of it with a public announcement. AI make line go up, at least for now.

43. echelon ◴[08 Dec 25 00:15 UTC] No.46186850{7}[source]▶

>>46178815 #

How much would it cost the community to pretrain something with a more modern architecture?

Assuming it was carefully done in stages (more compute) to make sure no mistakes are made?

I suppose we won't need to with the Chinese gifting so much open source recently?

replies(1): >>46195403 #

44. kouteiheika ◴[08 Dec 25 17:47 UTC] No.46195317{8}[source]▶

>>46180111 #

AFAIK a big part of it is that they distilled the guidance into the model.

I'm going to simplify all of this a lot so please bear with me, but normally the equation to denoise an image would look something like this:

    pos = model(latent, positive_prompt_emb)
    neg = model(latent, negative_prompt_emb)
    next_latent = latent + dt * (neg + cfg_scale * (pos - neg))

So what this does - you trigger the model once with a negative prompt (which can be empty) to get the "starting point" for the prediction, and then you run the model again with a positive prompt to get the direction in which you want to go, and then you combine them.

So, for example, let's assume your positive prompt is "dog", and your negative prompt is empty. So triggering the model with your empty prompt with generate a "neutral" latent, and then you nudge it into the direction of your positive prompt, in the direction of a "dog". And you do this for 20 steps, and you get an image of a dog.

Now, for Flux the equation looks like this:

    next_latent = latent + dt * model(latent, positive_prompt_emb)

The guidance here was distilled into the model. It's cheaper to do inference with, but now we can't really train the model too much without destroying this embedded guidance (the model will just forget it and collapse).

There's also an issue of training dynamics. We don't know exactly how they trained their models, so it's impossible for us to jerry rig our training runs in a similar way. And if you don't match the original training dynamics when finetuning it also negatively affects the model.

So you might ask here - what if we just train the model for a really long time - will it be able to recover? And the answer is - yes, but at this point the most of the original model will essentially be overwritten. People actually did this for Flux Schnell, but you need way more resources to pull it off and the results can be disappointing: https://huggingface.co/lodestones/Chroma

replies(1): >>46199925 #

45. kouteiheika ◴[08 Dec 25 17:53 UTC] No.46195403{8}[source]▶

>>46186850 #

> How much would it cost the community to pretrain something with a more modern architecture?

Quite a lot. Search for "Chroma" (which was a partial-ish retraining of Flux Schnell) or Pony (which was a partial-ish retraining of SDXL). You're probably looking at a cost of at least tens of thousands or even hundred of thousands of dollars. Even bigger SDXL community finetunes like bigASP cost thousands.

And it's not only the compute that's the issue. You also need a ton of data. You need a big dataset, with millions of images, and you need it cleaned, filtered, and labeled.

And of course you need someone who knows what they're doing. Training these state-of-art models takes quite a bit of skill, especially since a lot of it is pretty much a black art.

46. magicalhippo ◴[09 Dec 25 00:49 UTC] No.46199925{9}[source]▶

>>46195317 #

Thanks for the extended reply, very illuminating. So the core issue is how they distilled it, ie that they "baked in the offset" so to speak.

I did try Chroma and I was quite disappointed, what I got out looked nowhere near as good as what was advertised. Now I have a better understanding why.

↑