It might be that our current tokenization is inefficient compared to how well image pipeline does. Language already does lot of compression but there might be even better way to represent it in latent space
replies(3):
But in the DeepSeek-OCR paper, compressing more text into the same number of visual input tokens leads to progressively worse output precision, so it's not a free lunch but a speed-quality tradeoff, and more fine-grained KV cache-compression methods might deliver better speedups without degrading the output as much.