←back to thread

221 points lnyan | 3 comments | | HN request time: 0s | source
1. b0a04gl ◴[] No.44397783[source]
image gets compressed into 256 tokens before language model sees it. ask it to add a hat and it redraws the whole face; because objects aren't stored as separate things. there's no persistent bear in memory. it all lives inside one fused latent soup, they're fresh samples under new constraints. every prompt tweak rebalances the whole embedding. that's why even small changes ripple across the image. i notice it like single shot scene synthesis, which is good for diff usecases
replies(1): >>44398092 #
2. leodriesch ◴[] No.44398092[source]
That's what I really like about Flux Kontext, it has similar editing capabilities to the multimodal models, but doesn't mess up the details. The editing with gpt-image-1 only really works for complete style changes like "make this ghibli", but not adding glasses to a photorealistic image and have it retain all the details.
replies(1): >>44399346 #
3. vunderba ◴[] No.44399346[source]
Agreed. Kontext's ability to basically do the equivalent of img2img inpainting is hugely impressive.

Even when used to add new details it sticks very strongly to the existing images overall aesthetic.

https://specularrealms.com/ai-transcripts/experiments-with-f...