Thoughts
- It's fast (~3 seconds on my RTX 4090)
- Surprisingly capable of maintaining image integrity even at high resolutions (1536x1024, sometimes 2048x2048)
- The adherence is impressive for a 6B parameter model
Some tests (2 / 4 passed):
Personally I find it works better as a refiner model downstream of Qwen-Image 20b which has significantly better prompt understanding but has an unnatural "smoothness" to its generated images.
Local AI will eventually be booming. It'll be more configurable, adaptable, hackable. "Free". And private.
Crude APIs can only get you so far.
I'm in favor of intelligent models like Nano Banana over ComfyUI messes (the future is the model, not the node graph).
I still think we need the ability to inject control layers and have full access to the model, because we lose too much utility by not having it.
I think we'll eventually get Nano Banana Pro smarts slimmed down and running on a local machine.
- Illustrating blog posts, articles, etc.
- A creativity tool for kids (and adults; consider memes).
- Generating ads. (Consider artisan production and specialized venues.)
- Generating assets for games and similar, such as backdrops and textures.
Like any tool, it takes certain skill to use, and the ability to understand the results.
It's incredibly clear who the devs assume the target market is.
overall it's fun and impressive. decent results using LoRA. you can achieve good looking results with as few as 8 inference steps, which takes 15-20 seconds on a Strix Halo. i also created a llama.cpp inherence custom node for prompt enhancement which has been helping with overall output quality.
https://fal.ai/models/fal-ai/z-image/turbo/api
Couple that with the LoRA, in about 3 seconds you can generate completely personalized images.
The speed alone is a big factor but if you put the model side by side with seedream and nanobanana and other models it's definitely in the top 5 and that's killer combo imho.
For ref, the Porcupine-cone creature that ZiT couldn't handle by itself in my aforementioned test was easily handled using a Qwen20b + ZiT refiner workflow and even with two separate models STILL runs faster than Flux2 [dev].
Roughly speaking the art seems to have three main functions:
1. promote the story to outsiders: this only works with human-made art
2. enhance the story for existing readers: AI helps here, but is contentious
3. motivate and inspire the author: works great with AI. The ease of exploration and pseudo-random permutations in the results are very useful properties here that you don't get from regular art
By now the author even has an agreement with an artist he frequently commissions that he can use his style in AI art in return for a small "royalty" payment for every such image that gets published in one of his stories. A solution driven both by the author's conscience and by the demands of the readers
Supports MPS (Metal Performance Shaders). Using something that skips Python entirely along with a mlx or gguf converted model file (if one exists) will likely be even faster.
For various reasons, I doubt there are any large scale SaaS-style providers operating this in production today.
It's not clear to me what you mean either, especially since female models are overwhelmingly more popular in general[1].
[1]: "Female models make up about 70% of the modeling industry workforce worldwide" https://zipdo.co/modeling-industry-statistics/
The people with the time and desire to do something are the ones most likely to do it, this is no brilliant observation.
[1]: https://github.com/Tongyi-MAI/Z-Image?tab=readme-ov-file#-qu...
I'm still curious whether this would run on a MacBook and how long would it take to generate an image. What machine are you using?
If I say “A man”, it’s fine. A black man, no problem. It’s when I add context and instructions is just seems to want to go with some Chinese man. Which is fine, but I would like to see more variety of people it’s trained on to create more diverse images. For non-people it’s amazingly good.
On replicate.com a single image takes 1.5s at a price of 1000 images per $1. Would be interesting to see how quick it is on ComfyUI Cloud.
Overall, running generative models locally on Macs seems very poor time investment.
It is amazing how far behind Apple Silicon is when it comes to use non- language models.
Using the reference code from Z-image on my M1 ultra, it takes 8 seconds per step. Over a minute for the default of 9 steps.
The bang:buck ratio of Z-Image Turbo is just bonkers.
With simplistic prompts, you quickly conclude that the small model size is the only limitation. Once you realize how good it is with detailed prompts, though, you find that you can get a lot more diversity out of it than you initially thought you could.
Absolute game-changer of a model IMO. It is competitive with Nano Banana Pro in some respects, and that's saying something.
Spending all your time on dates and wives and kids means you're not spending all your time building houses.
- 1.5s to generate an image at 512x512
- 3.5s to generate an image at 1024x1024
- 26.s to generate an image at 2048x2048
It uses almost all the 32Gb Gb of VRAM and GPU usage. I'm using the script from the HF post: https://huggingface.co/Tongyi-MAI/Z-Image-Turbo
Z-Image is getting traction because it fits on their tiny GPUs and does porn sure, but even with more compute Flux 2[dev] has no place.
Weak world knowledge, worse licensing, and it ruins the #1 benefit of a larger LLM backbone with post-training for JSON prompts.
LLMs already understand JSON, so additional training for JSON feels like a cheaper way to juice prompt adherence than more robust post-training.
And honestly even "full fat" Flux 2 has no great spot: Nano Banana Pro is better if you need strong editing, Seedream 4.5 is better if you need strong generation.
Flux has largely been met with a collective yawn.
The only thing Flux had going for it was photorealism and prompt adherence. But the skin and jaws of the humans it generated looked weird, it was difficult to fine tune, and the licensing was weird. Furthermore, Flux never had good aesthetics. It always felt plain.
Nobody doing anime or cartoons used Flux. SDXL continues to shine here. People doing photoreal kept using Midjourney.
https://github.com/Tongyi-MAI/Z-Image
Screenshot of site with network tools open to indicate link
EDIT: It's possible that this issue might have existed in an old cached version. I'll purge the cache just to make sure.
OTOH these are open-weight models released to the public. We don't get to use more advanced models for free; the free models are likely a byproduct of producing more advanced models anyway. These models can be the freemium tier, or gateway drugs, or a way of torpedoing the competition, if you don't want to believe in the goodwill of their producers.
(I'm not used to using Windows and I don't know how to do anything complicated on that OS. Unfortunately, the computer with the big GPU also runs Windows.)
I do nerdy computer things and I actually build things too, for example I busted up the limestone in my backyard in put in a patio and raised garden. Working 16 hours a day coding/or otherwise computering isn't that hard even if your brain is melted at the end of the day. 8 - 10 of physically hard labor and your body starts taking damage if you keep it up too long.
And really building houses is a terrible example! In the US we've been chronically behind on building millions of units of houses. People complain the processes are terribly slow and there is tons of downtime.
So yea, I don't think your analogy works at all.
If a disproportionate share of users are using image generation for generating attractive women, why is it out of place to put commensurate focus on that use case in demos and other promotional material?
Download the release here
* https://github.com/LostRuins/koboldcpp/releases/tag/v1.103
Download the config file here
* https://huggingface.co/koboldcpp/kcppt/resolve/main/z-image-...
Set +x to the koboldcpp executable and launch it, select 'Load config' and point at the config file, then hit 'launch'.
Wait until the model weights are downloaded and launched, then open a browser and go to:
* http://localhost:5001/sdui
EDIT: This will work for Linux, Windows and Mac
I tried this prompt on my username: "A painted UFO abducts the graffiti text "Accrual" painted on the side of a rusty bridge."
Fixed that for you: (and adults; consider porn).
I don't think you realize the extent of the “underground” nsfw genai community, which has to rely on open-weight models since API models all have prude filters.
Yep. It's pretty difficult to fine tune, mostly because it's a distilled model. You can fine tune it a little bit, but it will quickly collapse and start producing garbage, even though fundamentally it should have been an easier architecture to fine-tune compared to SDXL (since it uses the much more modern flow matching paradigm).
I think that's probably the reason why we never really got any good anime Flux models (at least not as good as they were for SDXL). You just don't have enough leeway to be able to train the model for long enough to make the model great for a domain it's currently suboptimal for without completely collapsing it.
Not "assume". That's what the target market is. Take a look at civitai and see what kind of images people generate and what LoRAs they train (just be sure to be logged in and disable all of the NSFW filters in the options).
It's the reason I'm holding off until the Z-Image Base version is released before adding to the official GenAI model comparisons.
But for a 6B model that can generate an image in under 5 seconds, it punches far above its weight class.
As to the passing images, there is white chocolate kit-kat (I know, blasphemy, right?).
Some interesting takeaways imo:
- Uses existing model backbones for text encoding & semantic tokens (why reinvent the wheel if you don't need to?)
- Trains on a whole lot of synthetic captions of different lengths, ostensibly generated using some existing vision LLM
- Solid text generation support is facilitated by training on all OCR'd text from the ground truth image. This seems to match how Nano Banana Pro got so good as well; I've seen its thinking tokens sketch out exactly what text to say in the image before it renders.
Agreed, but let’s not confuse what it is. Talking about safety is just “WE WONT EMBARRASS YOU IF YOU INVEST IN US”.
Clearly you have, but while on the topic, it is amazing to me that only came out 2.5 years ago.
Anything with "most cultures" were manga-influenced comic strips with kanji. Useless.
It means it respects nationality choices and if you don’t mention it that is your bad prompting and not a failure to not have the default nationality you would prefer.
And you work with credits
What about being distilled makes it harder to fine-tune?
[1] https://en.wikipedia.org/wiki/Deep_Learning_Super_Sampling
[2] https://en.wikipedia.org/wiki/GPUOpen#FidelityFX_Super_Resol...
It's just that NVIDIA GPU sucks (relatively) at *single-user* LLM inference and it makes people feel like Apple not so bad.
I get that it's tempting to say "we no longer have to program game engines, hurray", but at the same time, we've already done the work, we already have game engines that are relatively very computationally efficient and predictable. We understand graphics and simulation quite well.
Personally: I think there's an obvious future in using AI tools to generate game content. 3D modelling and animation can be very time consuming. If you could get an AI model to generate animated characters, you could save a lot of time. You could also empower a lot of indie devs who don't have 3D modelers to help them. AI tools to generate large maps, also super valuable. Replacing the game engine itself, I think it's a taller order than people realize, and maybe not actually desirable.
Exiting the consumer market is likely a mistake by Micron. If China takes that market segment, they'll eventually take the rest, eliminating most of Micron's value. Holding consumer is about keeping entry attacks covered.
I'm paying $43 / month for 500mbps at present and there's nothing special about that at all (in the US or globally). What might we finally use 1gbps+ for? Pulling down massive AI-built worlds of entertainment. Movies & TV streaming sure isn't going to challenge our future bandwidth capabilities.
The worlds are built and shared so quickly in the background that with some slight limitations you never notice the world building going on behind the scenes.
The world building doesn't happen locally. Multiple players connect to the same built world that is remote. There will be smaller hobbyist segments that will still world-build locally for numerous reasons (privacy for one).
The worlds can be constructed entirely before they're downloaded. There are good arguments for both approaches (build the entire world then allow it to be accessed, or attempt to world-build as you play). Both will likely be used over the coming decades, for different reasons and at different times (changes in capabilities will unlock new arguments for either as time goes on, with a likely back and forth where one pulls ahead then the other pulls ahead).
And it isn't even relevant. "most cultures" cannot read anything of it. So what's the nitpicking about?
I actually think their move to shut down the Crucial channel will prove to be a good one. Why? Because we're heading toward a bimodal distribution of outcomes: either the AI bubble won't pop, and it will pay to prioritize the data center customers, or it will pop. In the latter case a consumer/business-facing RAM manufacturer will have to compete with its own surplus/unused product on scales never seen before.
Worst case scenario for Micron/Crucial, all those warehouses full of wafers that Altman has reserved are going to end up back in the normal RAM marketplace anyway. So why not let him foot the bill for fabbing and storing them in the meantime? Seems that the RAM manufacturers are just trying to make the best of a perilous situation.
Censoring open-source models really doesn't make a lot of sense for China. Which could also be why local Deepseek instances are relatively easy to jailbreak.
Idk I just thought it was funny to read the ignorant comment that called the Chinese model useless because it rendered Chinese text and calling it Japanese. The model is trained to render English or Chinese text.
Assuming it was carefully done in stages (more compute) to make sure no mistakes are made?
I suppose we won't need to with the Chinese gifting so much open source recently?
I'm going to simplify all of this a lot so please bear with me, but normally the equation to denoise an image would look something like this:
pos = model(latent, positive_prompt_emb)
neg = model(latent, negative_prompt_emb)
next_latent = latent + dt * (neg + cfg_scale * (pos - neg))
So what this does - you trigger the model once with a negative prompt (which can be empty) to get the "starting point" for the prediction, and then you run the model again with a positive prompt to get the direction in which you want to go, and then you combine them.So, for example, let's assume your positive prompt is "dog", and your negative prompt is empty. So triggering the model with your empty prompt with generate a "neutral" latent, and then you nudge it into the direction of your positive prompt, in the direction of a "dog". And you do this for 20 steps, and you get an image of a dog.
Now, for Flux the equation looks like this:
next_latent = latent + dt * model(latent, positive_prompt_emb)
The guidance here was distilled into the model. It's cheaper to do inference with, but now we can't really train the model too much without destroying this embedded guidance (the model will just forget it and collapse).There's also an issue of training dynamics. We don't know exactly how they trained their models, so it's impossible for us to jerry rig our training runs in a similar way. And if you don't match the original training dynamics when finetuning it also negatively affects the model.
So you might ask here - what if we just train the model for a really long time - will it be able to recover? And the answer is - yes, but at this point the most of the original model will essentially be overwritten. People actually did this for Flux Schnell, but you need way more resources to pull it off and the results can be disappointing: https://huggingface.co/lodestones/Chroma
Quite a lot. Search for "Chroma" (which was a partial-ish retraining of Flux Schnell) or Pony (which was a partial-ish retraining of SDXL). You're probably looking at a cost of at least tens of thousands or even hundred of thousands of dollars. Even bigger SDXL community finetunes like bigASP cost thousands.
And it's not only the compute that's the issue. You also need a ton of data. You need a big dataset, with millions of images, and you need it cleaned, filtered, and labeled.
And of course you need someone who knows what they're doing. Training these state-of-art models takes quite a bit of skill, especially since a lot of it is pretty much a black art.
I did try Chroma and I was quite disappointed, what I got out looked nowhere near as good as what was advertised. Now I have a better understanding why.