Interesting - these models are all trained to do pixel-level(ish) measurement now, for bounding boxes and such. I wonder if you could railroad it into being accurate with the right prompt.
Feels like the "right" approach would be to have it write some code to measure how far off the elements are in the original vs recreated image, and then iterate using the numerical output of the program...