Z-Image: Powerful and highly efficient image generation model with 6B parameters

(github.com)

396 points doener | 2 comments | 30 Nov 25 11:36 UTC | HN request time: 0s | source

Show context

vunderba ◴[06 Dec 25 17:36 UTC] No.46175068[source]▶

>>46095817 (OP) #

I've done some preliminary testing with Z-Image Turbo in the past week.

Thoughts

- It's fast (~3 seconds on my RTX 4090)

- Surprisingly capable of maintaining image integrity even at high resolutions (1536x1024, sometimes 2048x2048)

- The adherence is impressive for a 6B parameter model

Some tests (2 / 4 passed):

https://imgpb.com/exMoQ

Personally I find it works better as a refiner model downstream of Qwen-Image 20b which has significantly better prompt understanding but has an unnatural "smoothness" to its generated images.

replies(6): >>46175104 #>>46175331 #>>46177028 #>>46177043 #>>46177543 #>>46178707 #

echelon ◴[06 Dec 25 17:41 UTC] No.46175104[source]▶

>>46175068 #

So does this finally replace SDXL?

Is Flux 1/2/Kontext left in the dust by the Z Image and Qwen combo?

replies(3): >>46175236 #>>46175387 #>>46178341 #

tripplyons ◴[06 Dec 25 18:00 UTC] No.46175236[source]▶

>>46175104 #

SDXL has been outclassed for a while, especially since Flux came out.

replies(2): >>46175257 #>>46177243 #

aeon_ai ◴[06 Dec 25 18:04 UTC] No.46175257{3}[source]▶

>>46175236 #

Subjective. Most in creative industries regularly still use SDXL.

Once Z-image base comes out and some real tuning can be done, I think it has a chance of replacing it for the function SDXL has

replies(2): >>46176045 #>>46178832 #

Scrapemist ◴[06 Dec 25 19:43 UTC] No.46176045{4}[source]▶

>>46175257 #

Source?

replies(1): >>46177285 #

echelon ◴[06 Dec 25 22:46 UTC] No.46177285{5}[source]▶

>>46176045 #

Most of the people I know doing local AI prefer SDXL to Flux. Lots of people are still using SDXL, even today.

Flux has largely been met with a collective yawn.

The only thing Flux had going for it was photorealism and prompt adherence. But the skin and jaws of the humans it generated looked weird, it was difficult to fine tune, and the licensing was weird. Furthermore, Flux never had good aesthetics. It always felt plain.

Nobody doing anime or cartoons used Flux. SDXL continues to shine here. People doing photoreal kept using Midjourney.

replies(1): >>46178815 #

kouteiheika ◴[07 Dec 25 03:09 UTC] No.46178815{6}[source]▶

>>46177285 #

> it was difficult to fine tune

Yep. It's pretty difficult to fine tune, mostly because it's a distilled model. You can fine tune it a little bit, but it will quickly collapse and start producing garbage, even though fundamentally it should have been an easier architecture to fine-tune compared to SDXL (since it uses the much more modern flow matching paradigm).

I think that's probably the reason why we never really got any good anime Flux models (at least not as good as they were for SDXL). You just don't have enough leeway to be able to train the model for long enough to make the model great for a domain it's currently suboptimal for without completely collapsing it.

replies(2): >>46180111 #>>46186850 #

magicalhippo ◴[07 Dec 25 08:20 UTC] No.46180111{7}[source]▶

>>46178815 #

> It's pretty difficult to fine tune, mostly because it's a distilled model.

What about being distilled makes it harder to fine-tune?

replies(1): >>46195317 #

1. kouteiheika ◴[08 Dec 25 17:47 UTC] No.46195317{8}[source]▶

>>46180111 #

AFAIK a big part of it is that they distilled the guidance into the model.

I'm going to simplify all of this a lot so please bear with me, but normally the equation to denoise an image would look something like this:

    pos = model(latent, positive_prompt_emb)
    neg = model(latent, negative_prompt_emb)
    next_latent = latent + dt * (neg + cfg_scale * (pos - neg))

So what this does - you trigger the model once with a negative prompt (which can be empty) to get the "starting point" for the prediction, and then you run the model again with a positive prompt to get the direction in which you want to go, and then you combine them.

So, for example, let's assume your positive prompt is "dog", and your negative prompt is empty. So triggering the model with your empty prompt with generate a "neutral" latent, and then you nudge it into the direction of your positive prompt, in the direction of a "dog". And you do this for 20 steps, and you get an image of a dog.

Now, for Flux the equation looks like this:

    next_latent = latent + dt * model(latent, positive_prompt_emb)

The guidance here was distilled into the model. It's cheaper to do inference with, but now we can't really train the model too much without destroying this embedded guidance (the model will just forget it and collapse).

There's also an issue of training dynamics. We don't know exactly how they trained their models, so it's impossible for us to jerry rig our training runs in a similar way. And if you don't match the original training dynamics when finetuning it also negatively affects the model.

So you might ask here - what if we just train the model for a really long time - will it be able to recover? And the answer is - yes, but at this point the most of the original model will essentially be overwritten. People actually did this for Flux Schnell, but you need way more resources to pull it off and the results can be disappointing: https://huggingface.co/lodestones/Chroma

replies(1): >>46199925 #

2. magicalhippo ◴[09 Dec 25 00:49 UTC] No.46199925[source]▶

>>46195317 (TP) #

Thanks for the extended reply, very illuminating. So the core issue is how they distilled it, ie that they "baked in the offset" so to speak.

I did try Chroma and I was quite disappointed, what I got out looked nowhere near as good as what was advertised. Now I have a better understanding why.

↑