Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

(twitter.com)

https://xcancel.com/karpathy/status/1980397031542989305

Show context

varispeed ◴[22 Oct 25 22:46 UTC] No.45676118[source]▶

Text is linear, whereas image is parallel. I mean when people often read they don't scan text from left to right (or different direction, depending on language), but rather read the text all at once or non-linearly. Like first lock on keywords and then read adjacent words to get meaning, often even skipping some filler sentences unconsciously.

Sequential reading of text is very inefficient.

replies(4): >>45676232 #>>45676919 #>>45677443 #>>45677649 #

1. spiralcoaster ◴[23 Oct 25 00:40 UTC] No.45676919[source]▶

>>45676118 #

What people do you know that do this? I absolutely read in a linear fashion unless I'm deliberately skimming something to get the gist of it. Who can read the text "all at once"?!

replies(2): >>45677117 #>>45677476 #

2. numpad0 ◴[23 Oct 25 01:14 UTC] No.45677117[source]▶

>>45676919 (TP) #

I don't know how common it is, but I tend to read novels in a buttered heterogeneous multithreading mode - image and logical and emotional readings all go at each their own paces, rather than a singular OCR engine feeding them all with 1D text

is that crazy? I'm not buying it is

replies(2): >>45677217 #>>45677762 #

3. bigbluedots ◴[23 Oct 25 01:32 UTC] No.45677217[source]▶

>>45677117 #

Don't know, probably? I'm a linear reader

4. ants_everywhere ◴[23 Oct 25 02:21 UTC] No.45677476[source]▶

>>45676919 (TP) #

I do this. I'm autistic and have ADHD so I'm not representative of the normal person. However, I don't think this is entirely uncommon.

The relevant technical term is "saccade"

> ADHD: Studies have shown a consistent reduction in ability to suppress unwanted saccades, suggesting an impaired functioning of areas like the dorsolateral prefrontal cortex.

> Autism: An elevated number of antisaccade errors has been consistently reported, which may be due to disturbances in frontal cortical areas.

https://eyewiki.org/Saccade

Also see https://en.wikipedia.org/wiki/Eye_movement_in_reading

replies(1): >>45677656 #

5. alwa ◴[23 Oct 25 02:48 UTC] No.45677656[source]▶

>>45677476 #

I do this too. I suspect it may involve a subtly different mechanism from the saccade itself though? If the saccade is the behavior, and per the eyewiki link skimming is a voluntary type of saccade, there’s still the question of what leads me to use that behavior when I read (and others to read more linearly). Although you could certainly watch my eyes “saccade” around as I move nonlinearly through a passage, I’m not sure it’s out of a lack of control.

Rather, I feel like I absorb written meaning in units closer to paragraphs than to words or sentences. I’d describe my rapid up-and-down, back-and-forth eye motions as something closer to going back to soak up more, if that makes sense. To reinterpret it in the context of what came after it. The analogy that comes to mind is to a Progressive JPEG getting crisper as more loads.

That eyewiki entry was really cool. Among the unexpectedly interesting bits:

> The initiation of a saccade takes about 200 milliseconds[4]. Saccades are said to be ballistic because the movements are predetermined at initiation, and the saccade generating system cannot respond to subsequent changes in the position of the target after saccade initiation[4].

6. alwa ◴[23 Oct 25 03:07 UTC] No.45677762[source]▶

>>45677117 #

That description feels relatable to me. Maybe buffered more than buttered, in my case ;)

It seems to me that would be a tick in the “pro” column for this idea of using pixels (or contours, a la JPEG) as the models’ fundamental stimulus to train against (as opposed to textual tokens). Isn’t there a comparison to be drawn between the “threads” you describe here, and the multi-headed attention mechanisms (or whatever it is) that the LLM models use to weigh associations at various distances between tokens?

↑