The protocol is sort of:
1. I'd like you to display this PNG. Here's the data: ...
2. Ok I've got the data.
3. Ok now display it at this position.
4. Ok now remove it from the screen.
We're talking motion-PNG here. Just think about how awful that is.
I wish someone would add some kind of AV1-over-terminal protocol. That would be actually useful.
The other thing I was going to try was a custom GUI that used normal terminal text for the text of widgets, but Kitty images for the rest. It's quite a hard problem though.