Most active commenters

ACCount37(4)

Popular/hot comments

>>45676118 #
>>45661042 #
>>45675879 #
>>45676189 #

Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

(twitter.com)

https://xcancel.com/karpathy/status/1980397031542989305

1. yunwal ◴[21 Oct 25 20:11 UTC] No.45661042[source]▶

>>45658928 (OP) #

> The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input.

> Maybe it makes more sense that all inputs to LLMs should only ever be images.

So, what, every time I want to ask an LLM a question I paint a picture? I mean at that point why not just say "all input to LLMs should be embeddings"?

replies(3): >>45661392 #>>45675872 #>>45676027 #

2. smegma2 ◴[21 Oct 25 20:39 UTC] No.45661392[source]▶

>>45661042 #

No? He’s talking about rendered text

replies(1): >>45675927 #

3. dang ◴[22 Oct 25 22:06 UTC] No.45675747[source]▶

>>45658928 (OP) #

Recent and related:

Getting DeepSeek-OCR working on an Nvidia Spark via brute force with Claude Code - https://news.ycombinator.com/item?id=45646559 - Oct 2025 (43 comments)

DeepSeek OCR - https://news.ycombinator.com/item?id=45640594 - Oct 2025 (238 comments)

4. fspeech ◴[22 Oct 25 22:18 UTC] No.45675872[source]▶

>>45661042 #

If you can read your input on your screen your computer apparently knows how to convert your texts to images.

5. sabareesh ◴[22 Oct 25 22:18 UTC] No.45675879[source]▶

>>45658928 (OP) #

It might be that our current tokenization is inefficient compared to how well image pipeline does. Language already does lot of compression but there might be even better way to represent it in latent space

replies(3): >>45675953 #>>45676049 #>>45677115 #

6. rhdunn ◴[22 Oct 25 22:23 UTC] No.45675927{3}[source]▶

>>45661392 #

From the post he's referring to text input as well:

> Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in:

Italicized emphasis mine.

So he's suggesting that/wondering if the vision model should be the only input to the LLM and have that read the text. So there would be a rasterization step on the text input to generate the image.

Thus, you don't need to draw a picture but generate a raster of the text to feed it to the vision model.

7. ACCount37 ◴[22 Oct 25 22:26 UTC] No.45675953[source]▶

>>45675879 #

People in the industry know that tokenizers suck and there's room to do better. But actually doing it better? At scale? Now that's hard.

replies(1): >>45676189 #

8. hbarka ◴[22 Oct 25 22:32 UTC] No.45676016[source]▶

>>45658928 (OP) #

Chinese writing is logographic. Could this be giving Chinese developers a better intuition for pixels as input rather than text?

replies(1): >>45676915 #

9. CuriouslyC ◴[22 Oct 25 22:34 UTC] No.45676027[source]▶

>>45661042 #

All inputs being embeddings can work if you have embedding like Matryoshka, the hard part is adaptively selecting the embedding size for a given datum.

10. CuriouslyC ◴[22 Oct 25 22:37 UTC] No.45676049[source]▶

>>45675879 #

Image models use "larger" tokens. You can get this effect with text tokens if you use a larger token dictionary and generate common n-gram tokens, but the current LLM architecture isn't friendly to large output distributions.

replies(2): >>45677876 #>>45677936 #

11. varispeed ◴[22 Oct 25 22:46 UTC] No.45676118[source]▶

>>45658928 (OP) #

Text is linear, whereas image is parallel. I mean when people often read they don't scan text from left to right (or different direction, depending on language), but rather read the text all at once or non-linearly. Like first lock on keywords and then read adjacent words to get meaning, often even skipping some filler sentences unconsciously.

Sequential reading of text is very inefficient.

replies(4): >>45676232 #>>45676919 #>>45677443 #>>45677649 #

12. typpilol ◴[22 Oct 25 22:55 UTC] No.45676189{3}[source]▶

>>45675953 #

It will require like 20x the compute

replies(3): >>45676906 #>>45676935 #>>45676964 #

13. sosodev ◴[22 Oct 25 23:00 UTC] No.45676232[source]▶

>>45676118 #

LLMs don't "read" text sequentially, right?

replies(1): >>45676349 #

14. olliepro ◴[22 Oct 25 23:14 UTC] No.45676349{3}[source]▶

>>45676232 #

The causal masking means future tokens don’t affect previous tokens embeddings as they evolve throughout the model, but all tokens a processed in parallel… so, yes and no. See this previous HN post (https://news.ycombinator.com/item?id=45644328) about how bidirectional encoders are similar to diffusion’s non-linear way of generating text. Vision transformers use bidirectional encoding b/c of the non-causal nature of image pixels.

replies(1): >>45676819 #

15. cnxhk ◴[22 Oct 25 23:54 UTC] No.45676635[source]▶

>>45658928 (OP) #

The paper is quite interesting but efficiency on OCR tasks does not mean it could be plugged into a general llm directly without performance loss. If you train a tokenizer only on OCR text you might be able to get better compression already.

16. Merik ◴[23 Oct 25 00:22 UTC] No.45676819{4}[source]▶

>>45676349 #

Didn’t anthropic show that the models engage in a form of planning such that it is predicting a possible future subsequent tokens that then affects prediction of the next token: https://transformer-circuits.pub/2025/attribution-graphs/bio...

replies(1): >>45677066 #

17. Mehvix ◴[23 Oct 25 00:38 UTC] No.45676906{4}[source]▶

>>45676189 #

Why do you suppose this is a compute limited problem?

replies(1): >>45677057 #

18. anabis ◴[23 Oct 25 00:40 UTC] No.45676915[source]▶

>>45676016 #

Yeah, mapping chinese characters to linear UTF-8 space is throwing a lot of information away. Each language brings some ideas for text processing. sentencepiece inventor is Japanese, which doesn't have explicit word delimiters, for example.

19. spiralcoaster ◴[23 Oct 25 00:40 UTC] No.45676919[source]▶

>>45676118 #

What people do you know that do this? I absolutely read in a linear fashion unless I'm deliberately skimming something to get the gist of it. Who can read the text "all at once"?!

replies(2): >>45677117 #>>45677476 #

20. ACCount37 ◴[23 Oct 25 00:42 UTC] No.45676935{4}[source]▶

>>45676189 #

A lot of cool things are shot down by "it requires more compute, and by a lot, and we're already compute starved on any day of the week that ends in y, so, not worth it".

If we had a million times the compute? We might have brute forced our way to AGI by now.

replies(1): >>45676998 #

21. ianbutler ◴[23 Oct 25 00:47 UTC] No.45676963[source]▶

>>45658928 (OP) #

https://arxiv.org/abs/2510.17800 (Glyph: Scaling Context Windows via Visual-Text Compression)

You can also see this paper from the GLM team where they explicitly test this assumption to some pretty good results.

replies(1): >>45677374 #

22. kenjackson ◴[23 Oct 25 00:48 UTC] No.45676964{4}[source]▶

>>45676189 #

Why so much compute? Can you tie it to the problem?

23. Jensson ◴[23 Oct 25 00:52 UTC] No.45676998{5}[source]▶

>>45676935 #

But we don't have a million times the compute, we have the compute we have so its fair to argue that we want to prioritize other things.

24. viraptor ◴[23 Oct 25 01:03 UTC] No.45677055[source]▶

>>45658928 (OP) #

https://xcancel.com/karpathy/status/1980397031542989305

replies(2): >>45677190 #>>45677991 #

25. ACCount37 ◴[23 Oct 25 01:04 UTC] No.45677057{5}[source]▶

>>45676906 #

It's kind of a shortcut answer by now. Especially for anything that touches pretraining.

"Why aren't we doing X?", where X is a thing that sounds sensible, seems like it would help, and does indeed help, and there's even a paper here proving that it helps.

The answer is: check the paper, it says there on page 12 in a throwaway line that they used 3 times the compute for the new method than for the controls. And the gain was +4%.

A lot of promising things are resource hogs, and there are too many better things to burn the GPU-hours on.

26. ACCount37 ◴[23 Oct 25 01:06 UTC] No.45677066{5}[source]▶

>>45676819 #

Sure, an LLM can start "preparing" for token N+4 at token N. But that doesn't change that the token N can't "see" N+1.

Causality is enforced in LLMs - past tokens can affect future tokens, but not the other way around.

27. ◴[23 Oct 25 01:14 UTC] No.45677115[source]▶

>>45675879 #

28. numpad0 ◴[23 Oct 25 01:14 UTC] No.45677117{3}[source]▶

>>45676919 #

I don't know how common it is, but I tend to read novels in a buttered heterogeneous multithreading mode - image and logical and emotional readings all go at each their own paces, rather than a singular OCR engine feeding them all with 1D text

is that crazy? I'm not buying it is

replies(2): >>45677217 #>>45677762 #

29. kirubakaran ◴[23 Oct 25 01:27 UTC] No.45677190[source]▶

>>45677055 #

Thanks. There are also these:

- https://addons.mozilla.org/en-US/firefox/addon/toxcancel/

- https://chromewebstore.google.com/detail/xcancelcom-redirect...

30. bigbluedots ◴[23 Oct 25 01:32 UTC] No.45677217{4}[source]▶

>>45677117 #

Don't know, probably? I'm a linear reader

31. scotty79 ◴[23 Oct 25 02:00 UTC] No.45677374[source]▶

>>45676963 #

I couldn't imagine how rendering text tokens to images could bring any savings, but then I remembered esch token is converted into hundreds of floating point numbers before feeding it to neural network. So in a way it's already rendered into a multidimensional pixel (or hundreds of arbitrary 2-dimensional pixels). This papers shows that you don't need that many numbers to keep the accuracy and that using numbers that represent the text visually (which is pretty chaotic) is just as good as the way we currently do it.

32. scotty79 ◴[23 Oct 25 02:02 UTC] No.45677380[source]▶

>>45677161 #

It's kind of beautiful that they can actually do that.

33. nl ◴[23 Oct 25 02:03 UTC] No.45677385[source]▶

>>45658928 (OP) #

Kapathy's points are correct (of course).

One thing I like about text tokens though is that it learns some understanding of the text input method (particularly the QWERTY keyboard).

"Hello" and "Hwllo" are closer in semantic space than you'd think because "w" and "e" are next to each other.

This is much easier to see in hand coded spelling models, where you can get better results by including a "keybaord distance" metric along with a string distance metric.

replies(1): >>45677581 #

34. tcdent ◴[23 Oct 25 02:16 UTC] No.45677440[source]▶

>>45658928 (OP) #

"Kill the tokenizer" is such a wild proposition but is also founded in fundamentals.

Tokenizing text is such a hack even though it works pretty well. The state-of-the-art comes out of the gate with an approximation for quantifying language that's wrong on so many levels.

It's difficult to wrap my head around pixels being a more powerful representation of information, but someone's gotta come up with something other than tokenizer.

replies(1): >>45677780 #

35. ants_everywhere ◴[23 Oct 25 02:16 UTC] No.45677443[source]▶

>>45676118 #

some of us with ADHD just kind of read all the words at once

36. ants_everywhere ◴[23 Oct 25 02:21 UTC] No.45677476{3}[source]▶

>>45676919 #

I do this. I'm autistic and have ADHD so I'm not representative of the normal person. However, I don't think this is entirely uncommon.

The relevant technical term is "saccade"

> ADHD: Studies have shown a consistent reduction in ability to suppress unwanted saccades, suggesting an impaired functioning of areas like the dorsolateral prefrontal cortex.

> Autism: An elevated number of antisaccade errors has been consistently reported, which may be due to disturbances in frontal cortical areas.

https://eyewiki.org/Saccade

Also see https://en.wikipedia.org/wiki/Eye_movement_in_reading

replies(1): >>45677656 #

37. swyx ◴[23 Oct 25 02:36 UTC] No.45677581[source]▶

>>45677385 #

im particularly sympathetic to typo learning, which i think gets lost in the synthetic data discussion (mine here https://www.youtube.com/watch?v=yXPPcBlcF8U )

but i think in this case you can still generate typos in images and it'd be learnable. not a hard issue relevant to the OP

38. jb1991 ◴[23 Oct 25 02:47 UTC] No.45677649[source]▶

>>45676118 #

I think you’re making a lot of assumptions about how people read.

39. hiddencost ◴[23 Oct 25 02:48 UTC] No.45677655[source]▶

>>45658928 (OP) #

Back before transformers, or even LSTMs, we used to joke that image recognition was so far ahead of language modeling that we should just convert our text to PDF and run the pixels through a CNN.

40. alwa ◴[23 Oct 25 02:48 UTC] No.45677656{4}[source]▶

>>45677476 #

I do this too. I suspect it may involve a subtly different mechanism from the saccade itself though? If the saccade is the behavior, and per the eyewiki link skimming is a voluntary type of saccade, there’s still the question of what leads me to use that behavior when I read (and others to read more linearly). Although you could certainly watch my eyes “saccade” around as I move nonlinearly through a passage, I’m not sure it’s out of a lack of control.

Rather, I feel like I absorb written meaning in units closer to paragraphs than to words or sentences. I’d describe my rapid up-and-down, back-and-forth eye motions as something closer to going back to soak up more, if that makes sense. To reinterpret it in the context of what came after it. The analogy that comes to mind is to a Progressive JPEG getting crisper as more loads.

That eyewiki entry was really cool. Among the unexpectedly interesting bits:

> The initiation of a saccade takes about 200 milliseconds[4]. Saccades are said to be ballistic because the movements are predetermined at initiation, and the saccade generating system cannot respond to subsequent changes in the position of the target after saccade initiation[4].

41. alwa ◴[23 Oct 25 03:07 UTC] No.45677762{4}[source]▶

>>45677117 #

That description feels relatable to me. Maybe buffered more than buttered, in my case ;)

It seems to me that would be a tick in the “pro” column for this idea of using pixels (or contours, a la JPEG) as the models’ fundamental stimulus to train against (as opposed to textual tokens). Isn’t there a comparison to be drawn between the “threads” you describe here, and the multi-headed attention mechanisms (or whatever it is) that the LLM models use to weigh associations at various distances between tokens?

42. dgently7 ◴[23 Oct 25 03:10 UTC] No.45677780[source]▶

>>45677440 #

I consume all text as images when I read as a vision capable person so it kinda passes the evolution does it that way test and maybe we shouldn’t be that surprised that vision is a great input method?

Actually thinking more about that I consume “text” as images and also as sounds… I kinda wonder if instead of render and ocr like this suggests we did tts and just encoded like the mp3 sample of the vocalization of the word if that would be less bytes than the rendered pixels version… probably depends on the resolution / sample rate.

43. shikon7 ◴[23 Oct 25 03:28 UTC] No.45677871[source]▶

>>45658928 (OP) #

Seems we're now at a point of time when OCR is doing so well, that printing text out and letting computers literally read it is suggested to be superior to processing the endoded text directly.

replies(1): >>45677961 #

44. mark_l_watson ◴[23 Oct 25 03:29 UTC] No.45677876{3}[source]▶

>>45676049 #

Interesting idea! Haven’t heard that before.

45. yorwba ◴[23 Oct 25 03:41 UTC] No.45677936{3}[source]▶

>>45676049 #

You don't have to use the same token dictionary for input and output. There's things like simultaneously predicting multiple tokens ahead as an auxiliary loss and for speculative decoding, where the output is larger than the input, and similarly you could have a model where the input tokens combine multiple output tokens. You would still need to do a forward pass per output token during autoregressive generation, but prefill would require fewer passes and the KV cache would be smaller too, so it could still produce a decent speedup.

But in the DeepSeek-OCR paper, compressing more text into the same number of visual input tokens leads to progressively worse output precision, so it's not a free lunch but a speed-quality tradeoff, and more fine-grained KV cache-compression methods might deliver better speedups without degrading the output as much.

46. orliesaurus ◴[23 Oct 25 03:42 UTC] No.45677945[source]▶

>>45658928 (OP) #

one of the MOST interesting aspects of the recent discussion on this topic is how it underscores our reliance on lossy abstractions when representing language for machines. Tokenization is one such abstraction, but it's not the only one.... using raw pixels or speech signals is a different kind of approximation. what excites me about experiments like this is not so much that we'll all be handing images to language models tomorrow, but that researchers are pressure testing the design assumptions of current architectures. Approaches that learn to align multiple modalities might reveal better latent structures or training regimes, and that could trickle back into more efficient text encoders without throwing away a century of orthography. BUT there’s also a rich vein to mine in scripts and languages that don’t segment neatly into words: alternative encodings might help models handle those better.

47. programmarchy ◴[23 Oct 25 03:46 UTC] No.45677961[source]▶

>>45677871 #

PDF is arguably a confusing format for LLMs to read.

48. dang ◴[23 Oct 25 03:50 UTC] No.45677991[source]▶

>>45677055 #

Thanks! Added to toptext also.

↑