The right way to handle this is not to build it grids and whatnot, which all get blown away by the embedding encoding but to instruct it to build image processing tools of its own and to mandate their use in constructing the coordinates required and computing the eccentricity of the pattern etc in code and language space. Doing it this way you can even get it to write assertive tests comparing the original layout to the final among various image processing metrics. This would assuredly work better, take far less time, be more stable on iteration, and fits neatly into how a multimodal agentic programming tool actually functions.
But this isn’t hugely different than your vision. You don’t see the pixel grid either. You have to use tools to measure things. You have the ability over time to iteratively interact with the image by perhaps counting grid lines but the LLM does not - it’s a one shot inference against this highly transformed image. They’ve gotten better at complex visual tasks including types of counting, but it’s not able to examine the image in any analytical way or even in its original representation. It’s just not possible.
It can however make tools that can. It’s very good at working with PIL and other image processing libraries or even writing image processing code de novo, and then using those to ground itself. Likewise it can not do math, but it can write a calculator that can do highly complex mathematics on its behalf.