Most active commenters

vunderba(7)
SV_BubbleTime(6)
echelon(6)
CamperBob2(6)
(5)
kouteiheika(4)
pixl97(3)
thih9(3)
IncreasePosts(3)
accrual(3)

Popular/hot comments

>>46175232 #
>>46175068 #
>>46175042 #
>>46175079 #
>>46175104 #
>>46175649 #
>>46175657 #
>>46176197 #
>>46177028 #

Z-Image: Powerful and highly efficient image generation model with 6B parameters

(github.com)

1. Copenjin ◴[06 Dec 25 17:08 UTC] No.46174844[source]▶

>>46095817 (OP) #

Very good, not always perfect with text or with following exactly the prompt, but 6B so... impressive.

replies(1): >>46178360 #

2. pawelduda ◴[06 Dec 25 17:10 UTC] No.46174861[source]▶

>>46095817 (OP) #

Did anyone test it on 5090? I saw some 30xx reports and it seemed very fast

replies(2): >>46175501 #>>46177259 #

3. danielbln ◴[06 Dec 25 17:33 UTC] No.46175042[source]▶

>>46095817 (OP) #

We've come a long way with these image models, and the things you can do with paltry 6B are super impressive. The community has adopted this model wholesale, and left Flux(2) by the way side. It helps that Z-Image isn't censored, whereas BFL (makers of Flux 2) dedicated like a fith of their press release talking about how "safe" (read: censored and lobotomized) their model is.

replies(5): >>46176077 #>>46176627 #>>46177904 #>>46179212 #>>46181116 #

4. xnx ◴[06 Dec 25 17:34 UTC] No.46175053[source]▶

>>46095817 (OP) #

Z-Image seems to be the first successor to Stable Diffusion 1.5 that delivers better quality, capability, and extensibility across the board in an open model that can feasibly run locally. Excitement is high and an ecosystem is forming fast.

replies(1): >>46179220 #

5. vunderba ◴[06 Dec 25 17:36 UTC] No.46175068[source]▶

>>46095817 (OP) #

I've done some preliminary testing with Z-Image Turbo in the past week.

Thoughts

- It's fast (~3 seconds on my RTX 4090)

- Surprisingly capable of maintaining image integrity even at high resolutions (1536x1024, sometimes 2048x2048)

- The adherence is impressive for a 6B parameter model

Some tests (2 / 4 passed):

https://imgpb.com/exMoQ

Personally I find it works better as a refiner model downstream of Qwen-Image 20b which has significantly better prompt understanding but has an unnatural "smoothness" to its generated images.

replies(6): >>46175104 #>>46175331 #>>46177028 #>>46177043 #>>46177543 #>>46178707 #

6. zkmon ◴[06 Dec 25 17:38 UTC] No.46175079[source]▶

>>46095817 (OP) #

Just want to learn - who actually needs or buys up generated images?

replies(5): >>46175228 #>>46175234 #>>46175430 #>>46176354 #>>46181307 #

7. nine_k ◴[06 Dec 25 17:40 UTC] No.46175098[source]▶

>>46095817 (OP) #

It's amazing how much knowledge about the world fits into 16 GiB of the distilled model.

replies(2): >>46175113 #>>46175237 #

8. echelon ◴[06 Dec 25 17:41 UTC] No.46175104[source]▶

>>46175068 #

So does this finally replace SDXL?

Is Flux 1/2/Kontext left in the dust by the Z Image and Qwen combo?

replies(3): >>46175236 #>>46175387 #>>46178341 #

9. echelon ◴[06 Dec 25 17:42 UTC] No.46175113[source]▶

>>46175098 #

This is early days, too. We're probably going to get better at this across more domains.

Local AI will eventually be booming. It'll be more configurable, adaptable, hackable. "Free". And private.

Crude APIs can only get you so far.

I'm in favor of intelligent models like Nano Banana over ComfyUI messes (the future is the model, not the node graph).

I still think we need the ability to inject control layers and have full access to the model, because we lose too much utility by not having it.

I think we'll eventually get Nano Banana Pro smarts slimmed down and running on a local machine.

replies(2): >>46175649 #>>46176111 #

10. nine_k ◴[06 Dec 25 17:58 UTC] No.46175228[source]▶

>>46175079 #

Some ideas for your consideration:

- Illustrating blog posts, articles, etc.

- A creativity tool for kids (and adults; consider memes).

- Generating ads. (Consider artisan production and specialized venues.)

- Generating assets for games and similar, such as backdrops and textures.

Like any tool, it takes certain skill to use, and the ability to understand the results.

replies(2): >>46175841 #>>46178804 #

11. muglug ◴[06 Dec 25 17:59 UTC] No.46175232[source]▶

>>46095817 (OP) #

The [demo PDF](https://github.com/Tongyi-MAI/Z-Image/blob/main/assets/Z-Ima...) has ~50 photos of attractive young women sitting/standing alone, and exactly two photos featuring young attractive men on their own.

It's incredibly clear who the devs assume the target market is.

replies(13): >>46175632 #>>46175657 #>>46175660 #>>46176064 #>>46176152 #>>46176614 #>>46176640 #>>46177323 #>>46178316 #>>46178786 #>>46178823 #>>46178893 #>>46183169 #

12. leobg ◴[06 Dec 25 18:00 UTC] No.46175234[source]▶

>>46175079 #

Dying businesses like newspapers and local banks, who use it to save the money they used to spend on shutterstock images? That’s where I’ve seen it at least. Replacing one useless filler with another.

13. tripplyons ◴[06 Dec 25 18:00 UTC] No.46175236{3}[source]▶

>>46175104 #

SDXL has been outclassed for a while, especially since Flux came out.

replies(2): >>46175257 #>>46177243 #

14. ◴[06 Dec 25 18:00 UTC] No.46175237[source]▶

>>46175098 #

15. aeon_ai ◴[06 Dec 25 18:04 UTC] No.46175257{4}[source]▶

>>46175236 #

Subjective. Most in creative industries regularly still use SDXL.

Once Z-image base comes out and some real tuning can be done, I think it has a chance of replacing it for the function SDXL has

replies(2): >>46176045 #>>46178832 #

16. khimaros ◴[06 Dec 25 18:08 UTC] No.46175283[source]▶

>>46095817 (OP) #

i have been testing this on my Framework Desktop. ComfyUI generally causes an amdgpu kernel fault after about 40 steps (across multiple prompts), so i spent a few hours building a workaround here https://github.com/comfyanonymous/ComfyUI/pull/11143

overall it's fun and impressive. decent results using LoRA. you can achieve good looking results with as few as 8 inference steps, which takes 15-20 seconds on a Strix Halo. i also created a llama.cpp inherence custom node for prompt enhancement which has been helping with overall output quality.

17. amrrs ◴[06 Dec 25 18:15 UTC] No.46175331[source]▶

>>46175068 #

On fal, it takes less than a second many times.

https://fal.ai/models/fal-ai/z-image/turbo/api

Couple that with the LoRA, in about 3 seconds you can generate completely personalized images.

The speed alone is a big factor but if you put the model side by side with seedream and nanobanana and other models it's definitely in the top 5 and that's killer combo imho.

replies(1): >>46177047 #

18. xfalcox ◴[06 Dec 25 18:20 UTC] No.46175375[source]▶

>>46095817 (OP) #

We have vLLM for running text LLMs in production. What is the equivalent for this model?

replies(1): >>46176039 #

19. vunderba ◴[06 Dec 25 18:21 UTC] No.46175387{3}[source]▶

>>46175104 #

Yeah, I've definitely switched largely away from Flux. Much as I do like Flux (for prompt adherency), BFL's baffling licensing structure along with its excessive censorship makes it a noop.

For ref, the Porcupine-cone creature that ZiT couldn't handle by itself in my aforementioned test was easily handled using a Qwen20b + ZiT refiner workflow and even with two separate models STILL runs faster than Flux2 [dev].

https://imgur.com/a/5qYP0Vc

20. wongarsu ◴[06 Dec 25 18:25 UTC] No.46175430[source]▶

>>46175079 #

I follow an author who publishes online on places like Scribblehub and has a modestly successful Patreon. Over the years he has spent probably tens of thousands of dollars on commissioned art for his stories, and he's still spending heavily on that. But as image models have gotten better this has increasingly been supplemented with AI-images for things that are worth a couple dollars to get right with AI, but not a couple hundred to get a human artist to do them

Roughly speaking the art seems to have three main functions:

1. promote the story to outsiders: this only works with human-made art

2. enhance the story for existing readers: AI helps here, but is contentious

3. motivate and inspire the author: works great with AI. The ease of exploration and pseudo-random permutations in the results are very useful properties here that you don't get from regular art

By now the author even has an agreement with an artist he frequently commissions that he can use his style in AI art in return for a small "royalty" payment for every such image that gets published in one of his stories. A solution driven both by the author's conscience and by the demands of the readers

21. Wowfunhappy ◴[06 Dec 25 18:35 UTC] No.46175501[source]▶

>>46174861 #

Even on my 4080 it's extremely fast, it takes ~15 seconds per image.

replies(1): >>46177791 #

22. idontwantthis ◴[06 Dec 25 18:43 UTC] No.46175575[source]▶

>>46095817 (OP) #

Does it run on apple silicon?

replies(2): >>46175677 #>>46175681 #

23. bobsmooth ◴[06 Dec 25 18:50 UTC] No.46175632[source]▶

>>46175232 #

The ratio of naked female loras compared to naked male loras, or even non-porn loras, on civitai is at least 20 to 1. This shouldn't be surprising.

24. bobsmooth ◴[06 Dec 25 18:51 UTC] No.46175649{3}[source]▶

>>46175113 #

>Local AI will eventually be booming.

With how expensive RAM currently is, I doubt it.

replies(3): >>46178300 #>>46178600 #>>46182287 #

25. abbycurtis33 ◴[06 Dec 25 18:53 UTC] No.46175657[source]▶

>>46175232 #

They're correct. This tech, like much before it, is being driven by the base desires of extremely smart young men.

replies(3): >>46175938 #>>46176154 #>>46179407 #

26. iamflimflam1 ◴[06 Dec 25 18:53 UTC] No.46175660[source]▶

>>46175232 #

The model is uncensored, so will probably suite that target market admirably.

27. iamflimflam1 ◴[06 Dec 25 18:55 UTC] No.46175677[source]▶

>>46175575 #

It's working for me - it does max out my 64GB though.

replies(1): >>46175745 #

28. sheepscreek ◴[06 Dec 25 18:55 UTC] No.46175681[source]▶

>>46175575 #

Apparently - https://github.com/ivanfioravanti/z-image-mps

Supports MPS (Metal Performance Shaders). Using something that skips Python entirely along with a mlx or gguf converted model file (if one exists) will likely be even faster.

replies(1): >>46177478 #

29. sheepscreek ◴[06 Dec 25 19:04 UTC] No.46175745{3}[source]▶

>>46175677 #

Wow. I always forget how unlike autoregressive models, diffusion models are heavier on resources (for the same number of parameters).

30. zkmon ◴[06 Dec 25 19:16 UTC] No.46175841{3}[source]▶

>>46175228 #

Except for gaming, that doesn't sound like a huge market worthy of pouring millions into training these high-quality models. And there is a lot of competition too. I suspect there are some other deep-pocketed customers for these images. Probably animations? movies? TV ads?

replies(2): >>46176146 #>>46177793 #

31. mh- ◴[06 Dec 25 19:42 UTC] No.46176039[source]▶

>>46175375 #

I would say there's isn't an equivalent. Some people will probably tell you ComfyUI - you can expose workflows via API endpoints and parameterize them. This is how e.g. Krita AI Diffusion uses a ComfyUI backend.

For various reasons, I doubt there are any large scale SaaS-style providers operating this in production today.

replies(1): >>46180995 #

32. Scrapemist ◴[06 Dec 25 19:43 UTC] No.46176045{5}[source]▶

>>46175257 #

Source?

replies(1): >>46177285 #

33. thih9 ◴[06 Dec 25 19:46 UTC] No.46176064[source]▶

>>46175232 #

Please write what you mean instead of making veiled implications. What is the point of beating around the bush here?

It's not clear to me what you mean either, especially since female models are overwhelmingly more popular in general[1].

[1]: "Female models make up about 70% of the modeling industry workforce worldwide" https://zipdo.co/modeling-industry-statistics/

replies(1): >>46176574 #

34. rfoo ◴[06 Dec 25 19:48 UTC] No.46176077[source]▶

>>46175042 #

But this is a CCP model, would it refuse to generate Xi?

replies(2): >>46176316 #>>46177122 #

35. BoredPositron ◴[06 Dec 25 19:51 UTC] No.46176105[source]▶

>>46095817 (OP) #

I wish they would have used the WAN vae.

36. pixl97 ◴[06 Dec 25 19:56 UTC] No.46176144{4}[source]▶

>>46175938 #

I mean spending all that time on dates, and wives, and kids gives you much less time to build AI models.

The people with the time and desire to do something are the ones most likely to do it, this is no brilliant observation.

replies(1): >>46177131 #

37. pixl97 ◴[06 Dec 25 19:57 UTC] No.46176146{4}[source]▶

>>46175841 #

Propaganda?

38. killingtime74 ◴[06 Dec 25 19:58 UTC] No.46176152[source]▶

>>46175232 #

It's interesting the handsome guy is literally Tony Leung Chiu-wai, https://www.imdb.com/name/nm0504897/, not even modified

39. cma ◴[06 Dec 25 19:58 UTC] No.46176154{3}[source]▶

>>46175657 #

They maybe have an rhlf phase, but I mean there is also just the shape of the distribution of images on the internet and, since this is from alibaba, their part of the internet/social media (Weibo) to consider

40. thih9 ◴[06 Dec 25 20:04 UTC] No.46176197[source]▶

>>46095817 (OP) #

As an AI outsider with a recent 24GB macbook, can I follow the quick start[1] steps from the repo and expect decent results? How much time would it take to generate a single medium quality image?

[1]: https://github.com/Tongyi-MAI/Z-Image?tab=readme-ov-file#-qu...

replies(3): >>46176279 #>>46176998 #>>46178238 #

41. altmanaltman ◴[06 Dec 25 20:15 UTC] No.46176279[source]▶

>>46176197 #

If you don't know anything about AI in terms of how these models are run, comfyui's macos version is probably the easiset to use. There is already a Z-Image workflow that you can get and comfyui will get all the models you need and get it work together. Can expect decent speed

replies(2): >>46176433 #>>46176539 #

42. vunderba ◴[06 Dec 25 20:22 UTC] No.46176316{3}[source]▶

>>46176077 #

You tell me.

https://imgur.com/a/7FR3uT1

43. Youden ◴[06 Dec 25 20:28 UTC] No.46176354[source]▶

>>46175079 #

During the holiday season I've been noticing AI-generated assets on tons of meatspace ads and cheap, themed products.

44. abbycurtis33 ◴[06 Dec 25 20:29 UTC] No.46176365{4}[source]▶

>>46175938 #

With today's remote social validation for women and all time low value of men due to lower death rates and the disconnect from where food and shelter come from, lonely men make up a huge portion of the population.

45. thih9 ◴[06 Dec 25 20:41 UTC] No.46176433{3}[source]▶

>>46176279 #

I'm fine with the quick start steps and I prefer CLI to GUI anyway. But if I try it and find it too complex, I now know what to try instead - thanks.

I'm still curious whether this would run on a MacBook and how long would it take to generate an image. What machine are you using?

46. egeozcan ◴[06 Dec 25 20:57 UTC] No.46176539{3}[source]▶

>>46176279 #

Have a 48GB M4 Pro and every inference step takes like 10 seconds on a 1024x1024 image. so six steps and you need a minute. Not terrible, not great.

47. muglug ◴[06 Dec 25 21:04 UTC] No.46176574{3}[source]▶

>>46176064 #

> Female models make up about 70% of the modeling industry workforce worldwide

Ok so a ~2:1 ratio. Those examples have a 25:1 ratio.

replies(1): >>46180493 #

48. Manuel_D ◴[06 Dec 25 21:05 UTC] No.46176585{4}[source]▶

>>46175938 #

Something like >80% of men consume sexually explicit media. It's hardly limited to involuntarily celibate men.

replies(1): >>46177111 #

49. AuryGlenz ◴[06 Dec 25 21:10 UTC] No.46176614[source]▶

>>46175232 #

Considering how gaga r/stablediffusion is about it, they weren’t wrong. Apparently Flux 2 is dead in the water even though the knowledge it has contained in the model is way, way higher than Z-Image (unsurprisingly).

replies(1): >>46177281 #

50. AuryGlenz ◴[06 Dec 25 21:11 UTC] No.46176627[source]▶

>>46175042 #

To be fair, a lot of that was about their online service and not the model itself. It can definitely generate breasts.

That said I do find the focus on “safety” tiring.

51. cess11 ◴[06 Dec 25 21:14 UTC] No.46176640[source]▶

>>46175232 #

"The Internet is really, really great..."

https://www.youtube.com/watch?v=LTJvdGcb7Fs

52. cubefox ◴[06 Dec 25 21:26 UTC] No.46176725[source]▶

>>46095817 (OP) #

I'm particularly impressed by the fact that they seem to aim for photorealism rather than the semi-realistic AI-look that is common in many text-to-image models.

replies(1): >>46177161 #

53. bilsbie ◴[06 Dec 25 21:44 UTC] No.46176844[source]▶

>>46095817 (OP) #

What kind of rig is required to run this?

replies(2): >>46177088 #>>46179088 #

54. reactordev ◴[06 Dec 25 22:02 UTC] No.46176963[source]▶

>>46095817 (OP) #

My issue with this model is it keeps producing Chinese people and Chinese text. I have to very specifically go out of my way to say what kind of race they are.

If I say “A man”, it’s fine. A black man, no problem. It’s when I add context and instructions is just seems to want to go with some Chinese man. Which is fine, but I would like to see more variety of people it’s trained on to create more diverse images. For non-people it’s amazingly good.

replies(2): >>46177561 #>>46179253 #

55. aleyan ◴[06 Dec 25 22:07 UTC] No.46176998[source]▶

>>46176197 #

I have a 24GB M5 macbook pro. In ComfyUI using default z-image workflow, generating a single image just took me 399 seconds, during which the computer froze and my airpods lost audio.

On replicate.com a single image takes 1.5s at a price of 1000 images per $1. Would be interesting to see how quick it is on ComfyUI Cloud.

Overall, running generative models locally on Macs seems very poor time investment.

replies(1): >>46177423 #

56. tarruda ◴[06 Dec 25 22:12 UTC] No.46177028[source]▶

>>46175068 #

> It's fast (~3 seconds on my RTX 4090)

It is amazing how far behind Apple Silicon is when it comes to use non- language models.

Using the reference code from Z-image on my M1 ultra, it takes 8 seconds per step. Over a minute for the default of 9 steps.

replies(3): >>46177803 #>>46180602 #>>46183922 #

57. nialv7 ◴[06 Dec 25 22:15 UTC] No.46177043[source]▶

>>46175068 #

China really is keeping the open weight/source AI scene alive. If in five years a consumer GPU market still exists it would be because of them.

replies(1): >>46177814 #

58. venusenvy47 ◴[06 Dec 25 22:15 UTC] No.46177047{3}[source]▶

>>46175331 #

I don't know anything about paying for these services, and as a beginner, I worry about running up a huge bill. Do they let you set a limit on how much you pay? I see their pricing examples, but I've never tried one of these.

https://fal.ai/pricing

replies(2): >>46177422 #>>46179998 #

59. CamperBob2 ◴[06 Dec 25 22:20 UTC] No.46177088[source]▶

>>46176844 #

The simple Python example program runs great on almost any GPU with 8 GB or more memory. Takes about 1.5 seconds per iteration on a 4090.

The bang:buck ratio of Z-Image Turbo is just bonkers.

60. IncreasePosts ◴[06 Dec 25 22:24 UTC] No.46177111{5}[source]▶

>>46176585 #

It's not about consumption, it's about having a vast majority of your demo being sexy women instead of a balance.

replies(1): >>46178097 #

61. CamperBob2 ◴[06 Dec 25 22:25 UTC] No.46177122{3}[source]▶

>>46176077 #

It will generate anything. Xi/Pooh porn, Taylor Swift getting squashed by a tank at Tiananmen Square, whatever, no censorship at all.

With simplistic prompts, you quickly conclude that the small model size is the only limitation. Once you realize how good it is with detailed prompts, though, you find that you can get a lot more diversity out of it than you initially thought you could.

Absolute game-changer of a model IMO. It is competitive with Nano Banana Pro in some respects, and that's saying something.

replies(1): >>46177341 #

62. IncreasePosts ◴[06 Dec 25 22:26 UTC] No.46177131{5}[source]▶

>>46176144 #

You could say that about any field, and yet we don't see the same behavior in most other fields

Spending all your time on dates and wives and kids means you're not spending all your time building houses.

replies(1): >>46177981 #

63. CamperBob2 ◴[06 Dec 25 22:29 UTC] No.46177161[source]▶

>>46176725 #

Exactly, and at the same time, if you want an affected style, all you have to do is ask for it.

64. echelon ◴[06 Dec 25 22:40 UTC] No.46177236{4}[source]▶

>>46176111 #

Is this a joke?

Image and video models are some of the most useful tools of the last few decades.

replies(1): >>46179453 #

65. ◴[06 Dec 25 22:41 UTC] No.46177243{4}[source]▶

>>46175236 #

66. egeres ◴[06 Dec 25 22:43 UTC] No.46177259[source]▶

>>46174861 #

Incredibly fast, on my 5090 with CUDA 13 (& the latest diffusers, xformers, transformers, etc...), 9 samplig steps and the "Tongyi-MAI/Z-Image-Turbo" model I get:

- 1.5s to generate an image at 512x512

- 3.5s to generate an image at 1024x1024

- 26.s to generate an image at 2048x2048

It uses almost all the 32Gb Gb of VRAM and GPU usage. I'm using the script from the HF post: https://huggingface.co/Tongyi-MAI/Z-Image-Turbo

replies(1): >>46179262 #

67. BoorishBears ◴[06 Dec 25 22:45 UTC] No.46177281{3}[source]▶

>>46176614 #

Flux 2[dev] is awful.

Z-Image is getting traction because it fits on their tiny GPUs and does porn sure, but even with more compute Flux 2[dev] has no place.

Weak world knowledge, worse licensing, and it ruins the #1 benefit of a larger LLM backbone with post-training for JSON prompts.

LLMs already understand JSON, so additional training for JSON feels like a cheaper way to juice prompt adherence than more robust post-training.

And honestly even "full fat" Flux 2 has no great spot: Nano Banana Pro is better if you need strong editing, Seedream 4.5 is better if you need strong generation.

replies(1): >>46179788 #

68. echelon ◴[06 Dec 25 22:46 UTC] No.46177285{6}[source]▶

>>46176045 #

Most of the people I know doing local AI prefer SDXL to Flux. Lots of people are still using SDXL, even today.

Flux has largely been met with a collective yawn.

The only thing Flux had going for it was photorealism and prompt adherence. But the skin and jaws of the humans it generated looked weird, it was difficult to fine tune, and the licensing was weird. Furthermore, Flux never had good aesthetics. It always felt plain.

Nobody doing anime or cartoons used Flux. SDXL continues to shine here. People doing photoreal kept using Midjourney.

replies(1): >>46178815 #

69. mhb ◴[06 Dec 25 22:52 UTC] No.46177323[source]▶

>>46175232 #

Maybe both women and men prefer looking at attractive women.

70. cubefox ◴[06 Dec 25 22:56 UTC] No.46177341{4}[source]▶

>>46177122 #

I could imagine the Chinese government is not terribly interested in enforcing its censorship laws when this would conflict with boosting Chinese AI. Overregulation can be a significant inhibitor to innovation and competitiveness, as we often see in Europe.

replies(1): >>46183223 #

71. tethys ◴[06 Dec 25 23:07 UTC] No.46177422{4}[source]▶

>>46177047 #

It works with prepaid credits, so there should be no risk. Minimum credit amount is $10, though.

replies(1): >>46177695 #

72. opensandwich ◴[06 Dec 25 23:13 UTC] No.46177478{3}[source]▶

>>46175681 #

(Not tested) though apparently it already exists: https://github.com/leejet/stable-diffusion.cpp/wiki/How-to-U...

73. soontimes ◴[06 Dec 25 23:21 UTC] No.46177543[source]▶

>>46175068 #

If that’s your website please check GitHub link - it has a typo (gitub) and goes to a malicious site

replies(2): >>46177568 #>>46177584 #

74. orbital-decay ◴[06 Dec 25 23:24 UTC] No.46177561[source]▶

>>46176963 #

All modern models have their default looks. Meaningful variety of outputs for the same inputs in finetuned models is still an open technical problem. It's not impossible, but not solved either.

75. ◴[06 Dec 25 23:25 UTC] No.46177568{3}[source]▶

>>46177543 #

76. vunderba ◴[06 Dec 25 23:28 UTC] No.46177584{3}[source]▶

>>46177543 #

Thanks for the heads up. I just checked the site through several browsers and proxying through a VPN. There's no typo and it properly links to:

https://github.com/Tongyi-MAI/Z-Image

Screenshot of site with network tools open to indicate link

https://imgur.com/a/FZDz0K2

EDIT: It's possible that this issue might have existed in an old cached version. I'll purge the cache just to make sure.

replies(1): >>46177640 #

77. rprwhite ◴[06 Dec 25 23:37 UTC] No.46177640{4}[source]▶

>>46177584 #

The link with the typo is in the footer.

replies(1): >>46177659 #

78. vunderba ◴[06 Dec 25 23:40 UTC] No.46177659{5}[source]▶

>>46177640 #

Well holy crap - that's been there for about forever! I need a "domain name" spellchecker built into my Gulp CI/CD flow.

EDIT: Fixed! Thanks soontimes and rprwhite!

79. vunderba ◴[06 Dec 25 23:48 UTC] No.46177695{5}[source]▶

>>46177422 #

This. You can also run most (if not all) of the models that Fal.ai directly from the playground tab including Z-Image Turbo.

https://fal.ai/models/fal-ai/z-image/turbo

80. accrual ◴[06 Dec 25 23:56 UTC] No.46177791{3}[source]▶

>>46175501 #

Did you use PyTorch Native or Diffusers Inference? I couldn't get the former working yet so I used Diffusers, but it's terribly slow on my 4080 (4 min/image). Trying again with PyTorch now, seems like Diffusers is expected to be slow.

replies(1): >>46177830 #

81. nine_k ◴[06 Dec 25 23:56 UTC] No.46177793{4}[source]▶

>>46175841 #

I'd say that picture ad market alone would suffice.

OTOH these are open-weight models released to the public. We don't get to use more advanced models for free; the free models are likely a byproduct of producing more advanced models anyway. These models can be the freemium tier, or gateway drugs, or a way of torpedoing the competition, if you don't want to believe in the goodwill of their producers.

82. p-e-w ◴[06 Dec 25 23:57 UTC] No.46177803{3}[source]▶

>>46177028 #

The diffusion process is usually compute-bound, while transformer inference is memory-bound.

Apple Silicon is comparable in memory bandwidth to mid-range GPUs, but it’s light years behind on compute.

replies(1): >>46178177 #

83. p-e-w ◴[06 Dec 25 23:59 UTC] No.46177814{3}[source]▶

>>46177043 #

Pretty sure the consumer GPU market mostly exists because of games, which has nothing to do with China or AI.

replies(2): >>46178443 #>>46178968 #

84. Wowfunhappy ◴[07 Dec 25 00:01 UTC] No.46177830{4}[source]▶

>>46177791 #

Uh, not sure? I downloaded the portable build of ComfyUI and ran the CUDA-specific batch file it comes with.

(I'm not used to using Windows and I don't know how to do anything complicated on that OS. Unfortunately, the computer with the big GPU also runs Windows.)

replies(1): >>46177979 #

85. ForOldHack ◴[07 Dec 25 00:08 UTC] No.46177904[source]▶

>>46175042 #

Explain lobotomizing a Image Generator? Modern problems require modern terms.

86. ForOldHack ◴[07 Dec 25 00:09 UTC] No.46177907[source]▶

>>46095817 (OP) #

It would be more useful to have some standards on what one could expect in terms of hardware requirements and expected performance.

87. accrual ◴[07 Dec 25 00:19 UTC] No.46177979{5}[source]▶

>>46177830 #

Haha, I know how it goes. Thanks, I'll give that a try!

Update: works great and much faster via ComfyUI + the provided workflow file.

88. pixl97 ◴[07 Dec 25 00:20 UTC] No.46177981{6}[source]▶

>>46177131 #

I mean things that take hard physical labor are typically self limiting...

I do nerdy computer things and I actually build things too, for example I busted up the limestone in my backyard in put in a patio and raised garden. Working 16 hours a day coding/or otherwise computering isn't that hard even if your brain is melted at the end of the day. 8 - 10 of physically hard labor and your body starts taking damage if you keep it up too long.

And really building houses is a terrible example! In the US we've been chronically behind on building millions of units of houses. People complain the processes are terribly slow and there is tons of downtime.

So yea, I don't think your analogy works at all.

89. Manuel_D ◴[07 Dec 25 00:35 UTC] No.46178097{6}[source]▶

>>46177111 #

I'm still not following. Ads for a pickup truck are probably more likely to feature towing a boat than ads for a hatchback even if they're both capable of towing boats. Because buyers of the former are more likely to use the vehicle for that purpose.

If a disproportionate share of users are using image generation for generating attractive women, why is it out of place to put commensurate focus on that use case in demos and other promotional material?

replies(1): >>46182700 #

90. thot_experiment ◴[07 Dec 25 00:37 UTC] No.46178107[source]▶

>>46095817 (OP) #

I've messed with this a bit and the distill is incredibly overbaked. Curious to see the capabilities of the full model but I suspect even the base model is quite collapsed.

91. tarruda ◴[07 Dec 25 00:47 UTC] No.46178177{4}[source]▶

>>46177803 #

> but it’s light years behind on compute.

Is that the only factor though? I wonder if pytorch is lacking optimization for the MPS backend.

replies(1): >>46180929 #

92. Eisenstein ◴[07 Dec 25 00:57 UTC] No.46178238[source]▶

>>46176197 #

Try koboldcpp with the kcppt config file. The easiest way by far.

Download the release here

* https://github.com/LostRuins/koboldcpp/releases/tag/v1.103

Download the config file here

* https://huggingface.co/koboldcpp/kcppt/resolve/main/z-image-...

Set +x to the koboldcpp executable and launch it, select 'Load config' and point at the config file, then hit 'launch'.

Wait until the model weights are downloaded and launched, then open a browser and go to:

* http://localhost:5001/sdui

EDIT: This will work for Linux, Windows and Mac

93. api ◴[07 Dec 25 01:09 UTC] No.46178300{4}[source]▶

>>46175649 #

I’m old enough to remember many memory price spikes.

replies(2): >>46179231 #>>46181303 #

94. mythz ◴[07 Dec 25 01:16 UTC] No.46178341{3}[source]▶

>>46175104 #

SDXL has long been surpassed, it's primary redeeming feature is fine tuned variants for different focus and image styles.

IMO HiDream had the best quality OSS generations, Flux Schnell is decent as well. Will try out Z-Image soon.

95. accrual ◴[07 Dec 25 01:21 UTC] No.46178360[source]▶

>>46174844 #

I have had good textual results with the Turbo version so far. Sometimes it drops a letter in the output, but most of the time it adheres well to both the text requested and the style.

I tried this prompt on my username: "A painted UFO abducts the graffiti text "Accrual" painted on the side of a rusty bridge."

Results: https://imgur.com/a/z-image-test-hL1ACLd

96. gatane ◴[07 Dec 25 01:22 UTC] No.46178364[source]▶

>>46095817 (OP) #

Dude, please give money to artists instead of using genAI

97. ◴[07 Dec 25 01:41 UTC] No.46178443{4}[source]▶

>>46177814 #

98. echelon ◴[07 Dec 25 02:16 UTC] No.46178600{4}[source]▶

>>46175649 #

It's temporary. Sam Altman booked all the supply for a year. Give it time to unwind.

99. rendaw ◴[07 Dec 25 02:45 UTC] No.46178707[source]▶

>>46175068 #

That's 2/4? The kitkat bars look nothing like kitkat bars for the most part (logo? splits? white cream filling?). The DNA armor is made from normal metal links.

replies(1): >>46178934 #

100. Zopieux ◴[07 Dec 25 03:02 UTC] No.46178786[source]▶

>>46175232 #

Don't forget the expensive sport cars.

101. Zopieux ◴[07 Dec 25 03:05 UTC] No.46178804{3}[source]▶

>>46175228 #

>A creativity tool for kids (and adults; consider memes).

Fixed that for you: (and adults; consider porn).

I don't think you realize the extent of the “underground” nsfw genai community, which has to rely on open-weight models since API models all have prude filters.

102. kouteiheika ◴[07 Dec 25 03:09 UTC] No.46178815{7}[source]▶

>>46177285 #

> it was difficult to fine tune

Yep. It's pretty difficult to fine tune, mostly because it's a distilled model. You can fine tune it a little bit, but it will quickly collapse and start producing garbage, even though fundamentally it should have been an easier architecture to fine-tune compared to SDXL (since it uses the much more modern flow matching paradigm).

I think that's probably the reason why we never really got any good anime Flux models (at least not as good as they were for SDXL). You just don't have enough leeway to be able to train the model for long enough to make the model great for a domain it's currently suboptimal for without completely collapsing it.

replies(2): >>46180111 #>>46186850 #

103. kouteiheika ◴[07 Dec 25 03:12 UTC] No.46178823[source]▶

>>46175232 #

> It's incredibly clear who the devs assume the target market is.

Not "assume". That's what the target market is. Take a look at civitai and see what kind of images people generate and what LoRAs they train (just be sure to be logged in and disable all of the NSFW filters in the options).

replies(1): >>46184402 #

104. CuriouslyC ◴[07 Dec 25 03:13 UTC] No.46178832{5}[source]▶

>>46175257 #

I don't think that's fair. SDXL is crap at composition. It's really good with LoRAs to stylize/inpaint though.

105. CGamesPlay ◴[07 Dec 25 03:26 UTC] No.46178893[source]▶

>>46175232 #

I get the implication, but this is also the common configuration for fashion / beauty marketing.

106. vunderba ◴[07 Dec 25 03:36 UTC] No.46178934{3}[source]▶

>>46178707 #

Fair. Nobody said it was going to surpass Flux.1 Dev (a 12B parameter model) or Qwen-Image (a 20B parameter model) where prompt adherence is strictly concerned.

It's the reason I'm holding off until the Z-Image Base version is released before adding to the official GenAI model comparisons.

But for a 6B model that can generate an image in under 5 seconds, it punches far above its weight class.

As to the passing images, there is white chocolate kit-kat (I know, blasphemy, right?).

107. samus ◴[07 Dec 25 03:42 UTC] No.46178968{4}[source]▶

>>46177814 #

The consumer GPU market is not treated as a primary market by GPU makers anymore. Similar to how Micron went B2B-only.

replies(1): >>46181236 #

108. phantomathkg ◴[07 Dec 25 03:54 UTC] No.46179020[source]▶

>>46095817 (OP) #

Unfortunately, another China censored model. Simply ask it to generate "Tank Man" or "Lady Liberty Hong Kong" and the model return a blackboard with text saying "Maybe Not Safe".

replies(1): >>46181150 #

109. b0ner_t0ner ◴[07 Dec 25 04:10 UTC] No.46179088[source]▶

>>46176844 #

CPU can be used:

https://github.com/rupeshs/fastsdcpu/pull/346

110. icyfox ◴[07 Dec 25 04:18 UTC] No.46179113[source]▶

>>46095817 (OP) #

We talked about this model in some depth on the last Pretrained episode: https://youtu.be/5weFerGhO84?si=Eh_92_9PPKyiTU_h&t=1743

Some interesting takeaways imo:

- Uses existing model backbones for text encoding & semantic tokens (why reinvent the wheel if you don't need to?)

- Trains on a whole lot of synthetic captions of different lengths, ostensibly generated using some existing vision LLM

- Solid text generation support is facilitated by training on all OCR'd text from the ground truth image. This seems to match how Nano Banana Pro got so good as well; I've seen its thinking tokens sketch out exactly what text to say in the image before it renders.

111. SV_BubbleTime ◴[07 Dec 25 04:46 UTC] No.46179212[source]▶

>>46175042 #

> whereas BFL (makers of Flux 2) dedicated like a fith of their press release talking about how "safe" (read: censored and lobotomized) their model is.

Agreed, but let’s not confuse what it is. Talking about safety is just “WE WONT EMBARRASS YOU IF YOU INVEST IN US”.

112. SV_BubbleTime ◴[07 Dec 25 04:48 UTC] No.46179220[source]▶

>>46175053 #

Did you forget about SDXL?

Clearly you have, but while on the topic, it is amazing to me that only came out 2.5 years ago.

113. SV_BubbleTime ◴[07 Dec 25 04:49 UTC] No.46179231{5}[source]▶

>>46178300 #

I remember saving up for my first 128MB stick and the next week it was like triple in price.

114. GuestFAUniverse ◴[07 Dec 25 04:50 UTC] No.46179244[source]▶

>>46095817 (OP) #

All the examples I tried were garbage. Looked decent -- no horrors -- but didn't do the job.

Anything with "most cultures" were manga-influenced comic strips with kanji. Useless.

replies(1): >>46182468 #

115. SV_BubbleTime ◴[07 Dec 25 04:52 UTC] No.46179253[source]▶

>>46176963 #

I’m not sure how this is anything but a plus.

It means it respects nationality choices and if you don’t mention it that is your bad prompting and not a failure to not have the default nationality you would prefer.

116. SV_BubbleTime ◴[07 Dec 25 04:54 UTC] No.46179262{3}[source]▶

>>46177259 #

Weird, even at 2048 I don’t think it should be using all your 32GB VRAM.

replies(1): >>46180877 #

117. weregiraffe ◴[07 Dec 25 05:37 UTC] No.46179407{3}[source]▶

>>46175657 #

Gooners are base all right, but smart? Seriously? They can't even use their imagination to jerk off.

118. vachina ◴[07 Dec 25 05:47 UTC] No.46179453{5}[source]▶

>>46177236 #

Is this a joke?

119. GaggiX ◴[07 Dec 25 07:02 UTC] No.46179788{4}[source]▶

>>46177281 #

I didn't even know seedream 4.5 has been released, things move fast, I have used seedream 4 a lot through their API.

120. Tepix ◴[07 Dec 25 07:30 UTC] No.46179916[source]▶

>>46095817 (OP) #

I‘m wondering: Is it faster or slower when spread across two GPUs (RTX3090)?

121. Bombthecat ◴[07 Dec 25 07:48 UTC] No.46179998{4}[source]▶

>>46177047 #

For images I like them: https://runware.ai/ super cheap and super fast, they also support Loras and you can upload your own models.

And you work with credits

replies(1): >>46183028 #

122. magicalhippo ◴[07 Dec 25 08:20 UTC] No.46180111{8}[source]▶

>>46178815 #

> It's pretty difficult to fine tune, mostly because it's a distilled model.

What about being distilled makes it harder to fine-tune?

replies(1): >>46195317 #

123. cwillu ◴[07 Dec 25 09:49 UTC] No.46180493{4}[source]▶

>>46176574 #

No prize for guessing what the output for an empty prompt is.

124. tails4e ◴[07 Dec 25 10:09 UTC] No.46180602{3}[source]▶

>>46177028 #

I heard last year the potential future of gaming is not rendering but fully AI generated frames. 3 seconds per 'frame' now, it's not hard to believe it could do 60fps in a few short years. It makes it seem more likely such a game could exist. I'm not sure I like the idea, but it seems like it could happen

replies(2): >>46180853 #>>46181090 #

125. wcoenen ◴[07 Dec 25 11:02 UTC] No.46180853{4}[source]▶

>>46180602 #

Increasing the framerate by rendering at a lower resolution + upscaling, or outright generation of extra frames has already been a thing for a few years now. NVidia calls it Deep Learning Super Sampling (DLSS)[1]. AMD's equivalent is called FSR[2].

[1] https://en.wikipedia.org/wiki/Deep_Learning_Super_Sampling

[2] https://en.wikipedia.org/wiki/GPUOpen#FidelityFX_Super_Resol...

126. egeres ◴[07 Dec 25 11:08 UTC] No.46180877{4}[source]▶

>>46179262 #

It stays around 26Gb at 512x512. I still haven't profiled the execution or looked much into the details of the architecture but I would assume it trades off memory for speed by creating caches for each inference step

replies(1): >>46182526 #

127. rfoo ◴[07 Dec 25 11:20 UTC] No.46180929{5}[source]▶

>>46178177 #

This is the only factor. People sometimes perceive Apple's NPU as "fast" and "amazing" which is simply false.

It's just that NVIDIA GPU sucks (relatively) at *single-user* LLM inference and it makes people feel like Apple not so bad.

128. salty_frog ◴[07 Dec 25 11:33 UTC] No.46180995{3}[source]▶

>>46176039 #

I'm intrigued by the various reasons why you think there are not any large scale SAAS operating this in production?

replies(1): >>46184121 #

129. snek_case ◴[07 Dec 25 11:53 UTC] No.46181090{4}[source]▶

>>46180602 #

The problem is going to be how to control those models to produce a universe that's temporally and spatially consistent. Also think of other issues such as networked games, how would you even begin to approach that in this new paradigm? You need multiple models to have a shared representation that includes other players. You need to be able to sync data efficiently across the network.

I get that it's tempting to say "we no longer have to program game engines, hurray", but at the same time, we've already done the work, we already have game engines that are relatively very computationally efficient and predictable. We understand graphics and simulation quite well.

Personally: I think there's an obvious future in using AI tools to generate game content. 3D modelling and animation can be very time consuming. If you could get an AI model to generate animated characters, you could save a lot of time. You could also empower a lot of indie devs who don't have 3D modelers to help them. AI tools to generate large maps, also super valuable. Replacing the game engine itself, I think it's a taller order than people realize, and maybe not actually desirable.

replies(1): >>46181275 #

130. pferdone ◴[07 Dec 25 11:57 UTC] No.46181116[source]▶

>>46175042 #

It‘s mainly due to system requirements that Flux.2-dev doesn’t get same usage as Z-Image. A 5090 needs about a minute to generate an image with a basic workflow with Flux.2-dev. But prompt adherence and scene/character consistency in edit mode is (way) ahead of Qwen-Edit-2509 if you ask me.

131. user34283 ◴[07 Dec 25 12:05 UTC] No.46181150[source]▶

>>46179020 #

This is an issue with your provider. You need to download the model.

It generates an image of a tank and the statue of liberty for those prompts.

132. adventured ◴[07 Dec 25 12:24 UTC] No.46181236{5}[source]▶

>>46178968 #

The parent comment of course understands that. Nvidia views the gaming market as an entry threat, a vector from which a competitor can come after their AI GPU market. That's the reason Nvidia won't be looking to exit the gaming scene no matter how large their AI business gets. If done correctly, staying in the gaming GPU market helps to suppress competition.

Exiting the consumer market is likely a mistake by Micron. If China takes that market segment, they'll eventually take the rest, eliminating most of Micron's value. Holding consumer is about keeping entry attacks covered.

replies(1): >>46183145 #

133. adventured ◴[07 Dec 25 12:35 UTC] No.46181275{5}[source]▶

>>46181090 #

20 years out, what will everybody be using routine 10gbps pipes in our homes for?

I'm paying $43 / month for 500mbps at present and there's nothing special about that at all (in the US or globally). What might we finally use 1gbps+ for? Pulling down massive AI-built worlds of entertainment. Movies & TV streaming sure isn't going to challenge our future bandwidth capabilities.

The worlds are built and shared so quickly in the background that with some slight limitations you never notice the world building going on behind the scenes.

The world building doesn't happen locally. Multiple players connect to the same built world that is remote. There will be smaller hobbyist segments that will still world-build locally for numerous reasons (privacy for one).

The worlds can be constructed entirely before they're downloaded. There are good arguments for both approaches (build the entire world then allow it to be accessed, or attempt to world-build as you play). Both will likely be used over the coming decades, for different reasons and at different times (changes in capabilities will unlock new arguments for either as time goes on, with a likely back and forth where one pulls ahead then the other pulls ahead).

134. lomase ◴[07 Dec 25 12:43 UTC] No.46181303{5}[source]▶

>>46178300 #

Do you also remember when eveybody was waiting for cryto to cool off to buy a GPU?

135. lomase ◴[07 Dec 25 12:44 UTC] No.46181307[source]▶

>>46175079 #

Scammers do.

136. ArcaneMoose ◴[07 Dec 25 15:09 UTC] No.46182211[source]▶

>>46095817 (OP) #

This model is awesome. I am building an infinite CYOA game and this was a drop-in replacement for my scene image generation. Faster, cheaper, and higher quality than what I was using before!

137. gpm ◴[07 Dec 25 15:17 UTC] No.46182287{4}[source]▶

>>46175649 #

That's a short term effect. Long term Wright's law will kick in and ram will end up cheaper as a result of all the demand. It's not like there's a fundamental bottleneck on how much ram we could produce we're running into, just how much we're currently set up to produce.

138. GaggiX ◴[07 Dec 25 15:39 UTC] No.46182468[source]▶

>>46179244 #

>manga-influenced comic strips with kanji. Useless.

Are you sure it was Japanese? Because the model is Chinese so it's likely to output Chinese (it happened in my testing).

replies(1): >>46183025 #

139. SV_BubbleTime ◴[07 Dec 25 15:44 UTC] No.46182526{5}[source]▶

>>46180877 #

IDK. Seems odd. It’s an 11GB model, I don’t know what it could caching in ram.

140. IncreasePosts ◴[07 Dec 25 16:06 UTC] No.46182700{7}[source]▶

>>46178097 #

I think you would really need to show that's the case. I'm sure nano banana has a huge number of users not generating sexy women.

141. GuestFAUniverse ◴[07 Dec 25 16:44 UTC] No.46183025{3}[source]▶

>>46182468 #

Honestly I don't know if it was (Simplified) Chinese, or Japanese Kanji (so, symbols derived from Chinese).

And it isn't even relevant. "most cultures" cannot read anything of it. So what's the nitpicking about?

replies(1): >>46183289 #

142. Bombthecat ◴[07 Dec 25 16:45 UTC] No.46183028{5}[source]▶

>>46179998 #

Why the down vote? Are they scam?

143. CamperBob2 ◴[07 Dec 25 17:03 UTC] No.46183145{6}[source]▶

>>46181236 #

Exiting the consumer market is likely a mistake by Micron.

I actually think their move to shut down the Crucial channel will prove to be a good one. Why? Because we're heading toward a bimodal distribution of outcomes: either the AI bubble won't pop, and it will pay to prioritize the data center customers, or it will pop. In the latter case a consumer/business-facing RAM manufacturer will have to compete with its own surplus/unused product on scales never seen before.

Worst case scenario for Micron/Crucial, all those warehouses full of wafers that Altman has reserved are going to end up back in the normal RAM marketplace anyway. So why not let him foot the bill for fabbing and storing them in the meantime? Seems that the RAM manufacturers are just trying to make the best of a perilous situation.

replies(1): >>46184232 #

144. ◴[07 Dec 25 17:05 UTC] No.46183169[source]▶

>>46175232 #

145. CamperBob2 ◴[07 Dec 25 17:10 UTC] No.46183223{5}[source]▶

>>46177341 #

I'm sure they're also aware that few of their own citizens are in a position to run the model themselves, and that it's easy enough to use the system prompt to censor hosted copies for domestic consumption.

Censoring open-source models really doesn't make a lot of sense for China. Which could also be why local Deepseek instances are relatively easy to jailbreak.

146. GaggiX ◴[07 Dec 25 17:17 UTC] No.46183289{4}[source]▶

>>46183025 #

>So what's the nitpicking about?

Idk I just thought it was funny to read the ignorant comment that called the Chinese model useless because it rendered Chinese text and calling it Japanese. The model is trained to render English or Chinese text.

147. liuliu ◴[07 Dec 25 18:38 UTC] No.46183922{3}[source]▶

>>46177028 #

Not saying M1 Ultra is great. But you should only get ~8x slow down with proper implementation (such as Draw Things upcoming implementation for Z Image). It should be 2~3 sec per step. On M5 iPad, it is ~6s per step.

148. threeebo ◴[07 Dec 25 19:02 UTC] No.46184121{4}[source]▶

>>46180995 #

i dont believe there is a viable use case for large scale AI-generated images as there is for text... except for porn, but many orgs with SAAS capabilities wouldn't touch that

149. gunalx ◴[07 Dec 25 19:17 UTC] No.46184232{7}[source]▶

>>46183145 #

But why not just keep the consumer brand until stockpiles empty and blame supply issues until things possibly cool down, or people have forgotten the brand at all.

replies(1): >>46184824 #

150. iamflimflam1 ◴[07 Dec 25 19:36 UTC] No.46184402{3}[source]▶

>>46178823 #

Yeah - that was a bit of a shock! I'll just unblur these pictures - how hardcore could they be...

151. CamperBob2 ◴[07 Dec 25 20:26 UTC] No.46184824{8}[source]▶

>>46184232 #

I imagine the strategy would get out anyway as soon as retailers tried to place their next round of orders. Might as well get out in front of it with a public announcement. AI make line go up, at least for now.

152. echelon ◴[08 Dec 25 00:15 UTC] No.46186850{8}[source]▶

>>46178815 #

How much would it cost the community to pretrain something with a more modern architecture?

Assuming it was carefully done in stages (more compute) to make sure no mistakes are made?

I suppose we won't need to with the Chinese gifting so much open source recently?

replies(1): >>46195403 #

153. kouteiheika ◴[08 Dec 25 17:47 UTC] No.46195317{9}[source]▶

>>46180111 #

AFAIK a big part of it is that they distilled the guidance into the model.

I'm going to simplify all of this a lot so please bear with me, but normally the equation to denoise an image would look something like this:

    pos = model(latent, positive_prompt_emb)
    neg = model(latent, negative_prompt_emb)
    next_latent = latent + dt * (neg + cfg_scale * (pos - neg))

So what this does - you trigger the model once with a negative prompt (which can be empty) to get the "starting point" for the prediction, and then you run the model again with a positive prompt to get the direction in which you want to go, and then you combine them.

So, for example, let's assume your positive prompt is "dog", and your negative prompt is empty. So triggering the model with your empty prompt with generate a "neutral" latent, and then you nudge it into the direction of your positive prompt, in the direction of a "dog". And you do this for 20 steps, and you get an image of a dog.

Now, for Flux the equation looks like this:

    next_latent = latent + dt * model(latent, positive_prompt_emb)

The guidance here was distilled into the model. It's cheaper to do inference with, but now we can't really train the model too much without destroying this embedded guidance (the model will just forget it and collapse).

There's also an issue of training dynamics. We don't know exactly how they trained their models, so it's impossible for us to jerry rig our training runs in a similar way. And if you don't match the original training dynamics when finetuning it also negatively affects the model.

So you might ask here - what if we just train the model for a really long time - will it be able to recover? And the answer is - yes, but at this point the most of the original model will essentially be overwritten. People actually did this for Flux Schnell, but you need way more resources to pull it off and the results can be disappointing: https://huggingface.co/lodestones/Chroma

replies(1): >>46199925 #

154. kouteiheika ◴[08 Dec 25 17:53 UTC] No.46195403{9}[source]▶

>>46186850 #

> How much would it cost the community to pretrain something with a more modern architecture?

Quite a lot. Search for "Chroma" (which was a partial-ish retraining of Flux Schnell) or Pony (which was a partial-ish retraining of SDXL). You're probably looking at a cost of at least tens of thousands or even hundred of thousands of dollars. Even bigger SDXL community finetunes like bigASP cost thousands.

And it's not only the compute that's the issue. You also need a ton of data. You need a big dataset, with millions of images, and you need it cleaned, filtered, and labeled.

And of course you need someone who knows what they're doing. Training these state-of-art models takes quite a bit of skill, especially since a lot of it is pretty much a black art.

155. magicalhippo ◴[09 Dec 25 00:49 UTC] No.46199925{10}[source]▶

>>46195317 #

Thanks for the extended reply, very illuminating. So the core issue is how they distilled it, ie that they "baked in the offset" so to speak.

I did try Chroma and I was quite disappointed, what I got out looked nowhere near as good as what was advertised. Now I have a better understanding why.

↑