Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

(twitter.com)

369 points JnBrymn | 1 comments | 21 Oct 25 17:43 UTC | HN request time: 0.197s | source

https://xcancel.com/karpathy/status/1980397031542989305

Show context

tcdent ◴[23 Oct 25 02:16 UTC] No.45677440[source]▶

"Kill the tokenizer" is such a wild proposition but is also founded in fundamentals.

Tokenizing text is such a hack even though it works pretty well. The state-of-the-art comes out of the gate with an approximation for quantifying language that's wrong on so many levels.

It's difficult to wrap my head around pixels being a more powerful representation of information, but someone's gotta come up with something other than tokenizer.

replies(5): >>45677780 #>>45678765 #>>45680186 #>>45680335 #>>45681755 #

dgently7 ◴[23 Oct 25 03:10 UTC] No.45677780[source]▶

>>45677440 #

I consume all text as images when I read as a vision capable person so it kinda passes the evolution does it that way test and maybe we shouldn’t be that surprised that vision is a great input method?

Actually thinking more about that I consume “text” as images and also as sounds… I kinda wonder if instead of render and ocr like this suggests we did tts and just encoded like the mp3 sample of the vocalization of the word if that would be less bytes than the rendered pixels version… probably depends on the resolution / sample rate.

replies(2): >>45678180 #>>45678258 #

visarga ◴[23 Oct 25 04:30 UTC] No.45678180[source]▶

>>45677780 #

Funny, I habitually read while engaging TTS on same text. I have even made a Chrome extension for web reading, it highlights text and reads it, while keeping the current position in the viewport. I find using 2 modalities at the same time improves my concentration. TTS is sped up to 1.5x to match reading speed. Maybe it is just because I want to reduce visual strain. Since I consume a lot of text every day, it can be tiring.

replies(4): >>45678346 #>>45678903 #>>45680225 #>>45681161 #

1. gavinray ◴[23 Oct 25 10:13 UTC] No.45680225[source]▶

>>45678180 #

Any chance you could share the source?

I found that I can read better if individual words or chunks are highlighted in alternating pastel colors while I scan then with my eyes

↑