Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

(twitter.com)

233 points JnBrymn | 3 comments | 21 Oct 25 17:43 UTC | HN request time: 0.001s | source

https://xcancel.com/karpathy/status/1980397031542989305

Show context

varispeed ◴[22 Oct 25 22:46 UTC] No.45676118[source]▶

Text is linear, whereas image is parallel. I mean when people often read they don't scan text from left to right (or different direction, depending on language), but rather read the text all at once or non-linearly. Like first lock on keywords and then read adjacent words to get meaning, often even skipping some filler sentences unconsciously.

Sequential reading of text is very inefficient.

replies(4): >>45676232 #>>45676919 #>>45677443 #>>45677649 #

spiralcoaster ◴[23 Oct 25 00:40 UTC] No.45676919[source]▶

>>45676118 #

What people do you know that do this? I absolutely read in a linear fashion unless I'm deliberately skimming something to get the gist of it. Who can read the text "all at once"?!

replies(2): >>45677117 #>>45677476 #

1. numpad0 ◴[23 Oct 25 01:14 UTC] No.45677117[source]▶

>>45676919 #

I don't know how common it is, but I tend to read novels in a buttered heterogeneous multithreading mode - image and logical and emotional readings all go at each their own paces, rather than a singular OCR engine feeding them all with 1D text

is that crazy? I'm not buying it is

replies(2): >>45677217 #>>45677762 #

2. bigbluedots ◴[23 Oct 25 01:32 UTC] No.45677217[source]▶

>>45677117 (TP) #

Don't know, probably? I'm a linear reader

3. alwa ◴[23 Oct 25 03:07 UTC] No.45677762[source]▶

>>45677117 (TP) #

That description feels relatable to me. Maybe buffered more than buttered, in my case ;)

It seems to me that would be a tick in the “pro” column for this idea of using pixels (or contours, a la JPEG) as the models’ fundamental stimulus to train against (as opposed to textual tokens). Isn’t there a comparison to be drawn between the “threads” you describe here, and the multi-headed attention mechanisms (or whatever it is) that the LLM models use to weigh associations at various distances between tokens?

↑