Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

1. varispeed ◴[22 Oct 25 22:46 UTC] No.45676118[source]▶

Text is linear, whereas image is parallel. I mean when people often read they don't scan text from left to right (or different direction, depending on language), but rather read the text all at once or non-linearly. Like first lock on keywords and then read adjacent words to get meaning, often even skipping some filler sentences unconsciously.

Sequential reading of text is very inefficient.

replies(4): >>45676232 #>>45676919 #>>45677443 #>>45677649 #

2. sosodev ◴[22 Oct 25 23:00 UTC] No.45676232[source]▶

>>45676118 (TP) #

LLMs don't "read" text sequentially, right?

replies(1): >>45676349 #

3. olliepro ◴[22 Oct 25 23:14 UTC] No.45676349[source]▶

>>45676232 #

The causal masking means future tokens don’t affect previous tokens embeddings as they evolve throughout the model, but all tokens a processed in parallel… so, yes and no. See this previous HN post (https://news.ycombinator.com/item?id=45644328) about how bidirectional encoders are similar to diffusion’s non-linear way of generating text. Vision transformers use bidirectional encoding b/c of the non-causal nature of image pixels.

replies(1): >>45676819 #

4. Merik ◴[23 Oct 25 00:22 UTC] No.45676819{3}[source]▶

>>45676349 #

Didn’t anthropic show that the models engage in a form of planning such that it is predicting a possible future subsequent tokens that then affects prediction of the next token: https://transformer-circuits.pub/2025/attribution-graphs/bio...

replies(1): >>45677066 #

5. spiralcoaster ◴[23 Oct 25 00:40 UTC] No.45676919[source]▶

>>45676118 (TP) #

What people do you know that do this? I absolutely read in a linear fashion unless I'm deliberately skimming something to get the gist of it. Who can read the text "all at once"?!

replies(2): >>45677117 #>>45677476 #

6. ACCount37 ◴[23 Oct 25 01:06 UTC] No.45677066{4}[source]▶

>>45676819 #

Sure, an LLM can start "preparing" for token N+4 at token N. But that doesn't change that the token N can't "see" N+1.

Causality is enforced in LLMs - past tokens can affect future tokens, but not the other way around.

7. numpad0 ◴[23 Oct 25 01:14 UTC] No.45677117[source]▶

>>45676919 #

I don't know how common it is, but I tend to read novels in a buttered heterogeneous multithreading mode - image and logical and emotional readings all go at each their own paces, rather than a singular OCR engine feeding them all with 1D text

is that crazy? I'm not buying it is

replies(2): >>45677217 #>>45677762 #

8. bigbluedots ◴[23 Oct 25 01:32 UTC] No.45677217{3}[source]▶

>>45677117 #

Don't know, probably? I'm a linear reader

9. ants_everywhere ◴[23 Oct 25 02:16 UTC] No.45677443[source]▶

>>45676118 (TP) #

some of us with ADHD just kind of read all the words at once

10. ants_everywhere ◴[23 Oct 25 02:21 UTC] No.45677476[source]▶

>>45676919 #

I do this. I'm autistic and have ADHD so I'm not representative of the normal person. However, I don't think this is entirely uncommon.

The relevant technical term is "saccade"

> ADHD: Studies have shown a consistent reduction in ability to suppress unwanted saccades, suggesting an impaired functioning of areas like the dorsolateral prefrontal cortex.

> Autism: An elevated number of antisaccade errors has been consistently reported, which may be due to disturbances in frontal cortical areas.

https://eyewiki.org/Saccade

Also see https://en.wikipedia.org/wiki/Eye_movement_in_reading

replies(1): >>45677656 #

11. jb1991 ◴[23 Oct 25 02:47 UTC] No.45677649[source]▶

>>45676118 (TP) #

I think you’re making a lot of assumptions about how people read.

replies(1): >>45678093 #

12. alwa ◴[23 Oct 25 02:48 UTC] No.45677656{3}[source]▶

>>45677476 #

I do this too. I suspect it may involve a subtly different mechanism from the saccade itself though? If the saccade is the behavior, and per the eyewiki link skimming is a voluntary type of saccade, there’s still the question of what leads me to use that behavior when I read (and others to read more linearly). Although you could certainly watch my eyes “saccade” around as I move nonlinearly through a passage, I’m not sure it’s out of a lack of control.

Rather, I feel like I absorb written meaning in units closer to paragraphs than to words or sentences. I’d describe my rapid up-and-down, back-and-forth eye motions as something closer to going back to soak up more, if that makes sense. To reinterpret it in the context of what came after it. The analogy that comes to mind is to a Progressive JPEG getting crisper as more loads.

That eyewiki entry was really cool. Among the unexpectedly interesting bits:

> The initiation of a saccade takes about 200 milliseconds[4]. Saccades are said to be ballistic because the movements are predetermined at initiation, and the saccade generating system cannot respond to subsequent changes in the position of the target after saccade initiation[4].

13. alwa ◴[23 Oct 25 03:07 UTC] No.45677762{3}[source]▶

>>45677117 #

That description feels relatable to me. Maybe buffered more than buttered, in my case ;)

It seems to me that would be a tick in the “pro” column for this idea of using pixels (or contours, a la JPEG) as the models’ fundamental stimulus to train against (as opposed to textual tokens). Isn’t there a comparison to be drawn between the “threads” you describe here, and the multi-headed attention mechanisms (or whatever it is) that the LLM models use to weigh associations at various distances between tokens?

14. com2kid ◴[23 Oct 25 04:11 UTC] No.45678093[source]▶

>>45677649 #

He isn't, plenty of studies have been done on the topic. Eyes dart around a lot when reading.

replies(1): >>45678260 #

15. jb1991 ◴[23 Oct 25 04:52 UTC] No.45678260{3}[source]▶

>>45678093 #

People do skip words or scan for key phrases, but reading still happens in sequence. The brain depends on word order and syntax to make sense of text, so you cannot truly read it all at once. Skimming just means you sample parts of a linear structure, not that reading itself is non-linear. Eye-tracking studies confirm this sequential processing (check out the Rayner study in Psychological Bulletin if you are interested).