Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

(twitter.com)

237 points JnBrymn | 2 comments | 21 Oct 25 17:43 UTC | HN request time: 0s | source

https://xcancel.com/karpathy/status/1980397031542989305

Show context

sabareesh ◴[22 Oct 25 22:18 UTC] No.45675879[source]▶

It might be that our current tokenization is inefficient compared to how well image pipeline does. Language already does lot of compression but there might be even better way to represent it in latent space

replies(3): >>45675953 #>>45676049 #>>45677115 #

ACCount37 ◴[22 Oct 25 22:26 UTC] No.45675953[source]▶

>>45675879 #

People in the industry know that tokenizers suck and there's room to do better. But actually doing it better? At scale? Now that's hard.

replies(1): >>45676189 #

typpilol ◴[22 Oct 25 22:55 UTC] No.45676189[source]▶

>>45675953 #

It will require like 20x the compute

replies(3): >>45676906 #>>45676935 #>>45676964 #

1. kenjackson ◴[23 Oct 25 00:48 UTC] No.45676964[source]▶

>>45676189 #

Why so much compute? Can you tie it to the problem?

replies(1): >>45679126 #

2. typpilol ◴[23 Oct 25 07:18 UTC] No.45679126[source]▶

>>45676964 (TP) #

Tokenizers are the reason LLMs are even possible to run at a decent speed on our best hardware.

Removing the tokenizer would 1/4 the context and 4x the compute and memory, assuming an avg token length of 4.

Also, you would probably need to 4x the parameters to have to learn understanding between individual characters as well as words and sentences etc.

There's been a few studies on small models, even then those only show a tiny percentage gain over tokenized models.

So essentially you would need 4x compute, 1/4 the context, and 4x the parameters to squeeze 2-4% more performance out of it.

And that fails when you use more then 1/4 context. So realistically you need to support the same context, so you r compute goes up another 4x to 16x.

That's why

↑