(qwenlm.github.io)

221 points lnyan | 3 comments | 27 Jun 25 14:35 UTC | HN request time: 0s | source

1. b0a04gl ◴[27 Jun 25 15:54 UTC] No.44397783[source]▶

image gets compressed into 256 tokens before language model sees it. ask it to add a hat and it redraws the whole face; because objects aren't stored as separate things. there's no persistent bear in memory. it all lives inside one fused latent soup, they're fresh samples under new constraints. every prompt tweak rebalances the whole embedding. that's why even small changes ripple across the image. i notice it like single shot scene synthesis, which is good for diff usecases

replies(1): >>44398092 #

2. leodriesch ◴[27 Jun 25 16:25 UTC] No.44398092[source]▶

>>44397783 (TP) #

That's what I really like about Flux Kontext, it has similar editing capabilities to the multimodal models, but doesn't mess up the details. The editing with gpt-image-1 only really works for complete style changes like "make this ghibli", but not adding glasses to a photorealistic image and have it retain all the details.

replies(1): >>44399346 #

3. vunderba ◴[27 Jun 25 19:01 UTC] No.44399346[source]▶

>>44398092 #

Agreed. Kontext's ability to basically do the equivalent of img2img inpainting is hugely impressive.

Even when used to add new details it sticks very strongly to the existing images overall aesthetic.

https://specularrealms.com/ai-transcripts/experiments-with-f...

↑

Qwen VLo: From “Understanding” the World to “Depicting” It