Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

1. yunwal ◴[21 Oct 25 20:11 UTC] No.45661042[source]▶

> The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input.

> Maybe it makes more sense that all inputs to LLMs should only ever be images.

So, what, every time I want to ask an LLM a question I paint a picture? I mean at that point why not just say "all input to LLMs should be embeddings"?

replies(4): >>45661392 #>>45675872 #>>45676027 #>>45678135 #

2. smegma2 ◴[21 Oct 25 20:39 UTC] No.45661392[source]▶

>>45661042 (TP) #

No? He’s talking about rendered text

replies(1): >>45675927 #

3. fspeech ◴[22 Oct 25 22:18 UTC] No.45675872[source]▶

>>45661042 (TP) #

If you can read your input on your screen your computer apparently knows how to convert your texts to images.

4. rhdunn ◴[22 Oct 25 22:23 UTC] No.45675927[source]▶

>>45661392 #

From the post he's referring to text input as well:

> Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in:

Italicized emphasis mine.

So he's suggesting that/wondering if the vision model should be the only input to the LLM and have that read the text. So there would be a rasterization step on the text input to generate the image.

Thus, you don't need to draw a picture but generate a raster of the text to feed it to the vision model.

5. CuriouslyC ◴[22 Oct 25 22:34 UTC] No.45676027[source]▶

>>45661042 (TP) #

All inputs being embeddings can work if you have embedding like Matryoshka, the hard part is adaptively selecting the embedding size for a given datum.

6. awesome_dude ◴[23 Oct 25 04:19 UTC] No.45678135[source]▶

>>45661042 (TP) #

I mean, text is, after all, highly stylised images

It's trivial for text to be pasted in, and converted to pixels (that's what my, and every computer on the planet, does when showing me text)