Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

I couldn't imagine how rendering text tokens to images could bring any savings, but then I remembered esch token is converted into hundreds of floating point numbers before feeding it to neural network. So in a way it's already rendered into a multidimensional pixel (or hundreds of arbitrary 2-dimensional pixels). This papers shows that you don't need that many numbers to keep the accuracy and that using numbers that represent the text visually (which is pretty chaotic) is just as good as the way we currently do it.