> Typing "make the bar a bit more like a fizzbuzz" in some textbox is awful UX compared to, say, clicking on the "bar" and selecting "fizzbuzz" or drag-and-dropping "fizzbuzz" on the "bar" or really anything that takes advantage of the fact we're interacting with a graphical environment to do work on graphics.
That assumes that you have a UX capable of determining what you're clicking on in the generated image (which we could call a given if we assume a sufficiently capable AI model since we're already instructing it to alter the thing), and also that it can determine from your click that you've intended to click on the "foo" not the "greeble" that is on the foo or the shadow covering that part of the foo or anything else that might be in the same Z stack as your intended target. Pixel bitching adventure games come to mind as an example of how badly this can go for us. And yes, this is solvable, Squeak has a UI where repeatedly clicking in the same spot will iterate through the list of possibilities in that Z stack. But it could also get really messy really quickly.
Then we have to assume that your UX will be able to generate an entire list of possible things you might want to be able to to do with that thing that you've clicked, including adding to it, removing it, removing part of it, moving it, transforming its dimensions, altering its colors, altering the material surface and on and on and on. And that list of possibilities needs to be navigable and searchable in a way that's faster than just typing "make the bar more like a fizzbuzz" into a context aware chat box.
Again, I'm not arguing the chat interface should be the only interface. In fact, as you point out we're using a graphical system, it would be great if you could click on things or select them and have the system work on too. It should be able to take additional input than just chat. But I still think for iterating on a fuzzy idea, a chat UI is a useful tool.