Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

one of the MOST interesting aspects of the recent discussion on this topic is how it underscores our reliance on lossy abstractions when representing language for machines. Tokenization is one such abstraction, but it's not the only one.... using raw pixels or speech signals is a different kind of approximation. what excites me about experiments like this is not so much that we'll all be handing images to language models tomorrow, but that researchers are pressure testing the design assumptions of current architectures. Approaches that learn to align multiple modalities might reveal better latent structures or training regimes, and that could trickle back into more efficient text encoders without throwing away a century of orthography. BUT there’s also a rich vein to mine in scripts and languages that don’t segment neatly into words: alternative encodings might help models handle those better.