If we had a million times the compute? We might have brute forced our way to AGI by now.
"Why aren't we doing X?", where X is a thing that sounds sensible, seems like it would help, and does indeed help, and there's even a paper here proving that it helps.
The answer is: check the paper, it says there on page 12 in a throwaway line that they used 3 times the compute for the new method than for the controls. And the gain was +4%.
A lot of promising things are resource hogs, and there are too many better things to burn the GPU-hours on.
But in the DeepSeek-OCR paper, compressing more text into the same number of visual input tokens leads to progressively worse output precision, so it's not a free lunch but a speed-quality tradeoff, and more fine-grained KV cache-compression methods might deliver better speedups without degrading the output as much.
Removing the tokenizer would 1/4 the context and 4x the compute and memory, assuming an avg token length of 4.
Also, you would probably need to 4x the parameters to have to learn understanding between individual characters as well as words and sentences etc.
There's been a few studies on small models, even then those only show a tiny percentage gain over tokenized models.
So essentially you would need 4x compute, 1/4 the context, and 4x the parameters to squeeze 2-4% more performance out of it.
And that fails when you use more then 1/4 context. So realistically you need to support the same context, so you r compute goes up another 4x to 16x.
That's why