Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

(twitter.com)

233 points JnBrymn | 3 comments | 21 Oct 25 17:43 UTC | HN request time: 0s | source

https://xcancel.com/karpathy/status/1980397031542989305

Show context

sabareesh ◴[22 Oct 25 22:18 UTC] No.45675879[source]▶

It might be that our current tokenization is inefficient compared to how well image pipeline does. Language already does lot of compression but there might be even better way to represent it in latent space

replies(3): >>45675953 #>>45676049 #>>45677115 #

ACCount37 ◴[22 Oct 25 22:26 UTC] No.45675953[source]▶

>>45675879 #

People in the industry know that tokenizers suck and there's room to do better. But actually doing it better? At scale? Now that's hard.

replies(1): >>45676189 #

typpilol ◴[22 Oct 25 22:55 UTC] No.45676189[source]▶

>>45675953 #

It will require like 20x the compute

replies(3): >>45676906 #>>45676935 #>>45676964 #

1. Mehvix ◴[23 Oct 25 00:38 UTC] No.45676906[source]▶

>>45676189 #

Why do you suppose this is a compute limited problem?

replies(1): >>45677057 #

2. ACCount37 ◴[23 Oct 25 01:04 UTC] No.45677057[source]▶

>>45676906 (TP) #

It's kind of a shortcut answer by now. Especially for anything that touches pretraining.

"Why aren't we doing X?", where X is a thing that sounds sensible, seems like it would help, and does indeed help, and there's even a paper here proving that it helps.

The answer is: check the paper, it says there on page 12 in a throwaway line that they used 3 times the compute for the new method than for the controls. And the gain was +4%.

A lot of promising things are resource hogs, and there are too many better things to burn the GPU-hours on.

replies(1): >>45679116 #

3. typpilol ◴[23 Oct 25 07:16 UTC] No.45679116[source]▶

>>45677057 #

Thanks.

Also, saying it needs 20x compute is exactly that. It's something we could do eventually but not now

↑