Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

1. sabareesh ◴[22 Oct 25 22:18 UTC] No.45675879[source]▶

It might be that our current tokenization is inefficient compared to how well image pipeline does. Language already does lot of compression but there might be even better way to represent it in latent space

replies(3): >>45675953 #>>45676049 #>>45677115 #

2. ACCount37 ◴[22 Oct 25 22:26 UTC] No.45675953[source]▶

>>45675879 (TP) #

People in the industry know that tokenizers suck and there's room to do better. But actually doing it better? At scale? Now that's hard.

replies(1): >>45676189 #

3. CuriouslyC ◴[22 Oct 25 22:37 UTC] No.45676049[source]▶

>>45675879 (TP) #

Image models use "larger" tokens. You can get this effect with text tokens if you use a larger token dictionary and generate common n-gram tokens, but the current LLM architecture isn't friendly to large output distributions.

replies(2): >>45677876 #>>45677936 #

4. typpilol ◴[22 Oct 25 22:55 UTC] No.45676189[source]▶

>>45675953 #

It will require like 20x the compute

replies(3): >>45676906 #>>45676935 #>>45676964 #

5. Mehvix ◴[23 Oct 25 00:38 UTC] No.45676906{3}[source]▶

>>45676189 #

Why do you suppose this is a compute limited problem?

replies(1): >>45677057 #

6. ACCount37 ◴[23 Oct 25 00:42 UTC] No.45676935{3}[source]▶

>>45676189 #

A lot of cool things are shot down by "it requires more compute, and by a lot, and we're already compute starved on any day of the week that ends in y, so, not worth it".

If we had a million times the compute? We might have brute forced our way to AGI by now.

replies(1): >>45676998 #

7. kenjackson ◴[23 Oct 25 00:48 UTC] No.45676964{3}[source]▶

>>45676189 #

Why so much compute? Can you tie it to the problem?

replies(1): >>45679126 #

8. Jensson ◴[23 Oct 25 00:52 UTC] No.45676998{4}[source]▶

>>45676935 #

But we don't have a million times the compute, we have the compute we have so its fair to argue that we want to prioritize other things.

9. ACCount37 ◴[23 Oct 25 01:04 UTC] No.45677057{4}[source]▶

>>45676906 #

It's kind of a shortcut answer by now. Especially for anything that touches pretraining.

"Why aren't we doing X?", where X is a thing that sounds sensible, seems like it would help, and does indeed help, and there's even a paper here proving that it helps.

The answer is: check the paper, it says there on page 12 in a throwaway line that they used 3 times the compute for the new method than for the controls. And the gain was +4%.

A lot of promising things are resource hogs, and there are too many better things to burn the GPU-hours on.

replies(1): >>45679116 #

10. ◴[23 Oct 25 01:14 UTC] No.45677115[source]▶

>>45675879 (TP) #

11. mark_l_watson ◴[23 Oct 25 03:29 UTC] No.45677876[source]▶

>>45676049 #

Interesting idea! Haven’t heard that before.

12. yorwba ◴[23 Oct 25 03:41 UTC] No.45677936[source]▶

>>45676049 #

You don't have to use the same token dictionary for input and output. There's things like simultaneously predicting multiple tokens ahead as an auxiliary loss and for speculative decoding, where the output is larger than the input, and similarly you could have a model where the input tokens combine multiple output tokens. You would still need to do a forward pass per output token during autoregressive generation, but prefill would require fewer passes and the KV cache would be smaller too, so it could still produce a decent speedup.

But in the DeepSeek-OCR paper, compressing more text into the same number of visual input tokens leads to progressively worse output precision, so it's not a free lunch but a speed-quality tradeoff, and more fine-grained KV cache-compression methods might deliver better speedups without degrading the output as much.

13. typpilol ◴[23 Oct 25 07:16 UTC] No.45679116{5}[source]▶

>>45677057 #

Thanks.

Also, saying it needs 20x compute is exactly that. It's something we could do eventually but not now

14. typpilol ◴[23 Oct 25 07:18 UTC] No.45679126{4}[source]▶

>>45676964 #

Tokenizers are the reason LLMs are even possible to run at a decent speed on our best hardware.

Removing the tokenizer would 1/4 the context and 4x the compute and memory, assuming an avg token length of 4.

Also, you would probably need to 4x the parameters to have to learn understanding between individual characters as well as words and sentences etc.

There's been a few studies on small models, even then those only show a tiny percentage gain over tokenized models.

So essentially you would need 4x compute, 1/4 the context, and 4x the parameters to squeeze 2-4% more performance out of it.

And that fails when you use more then 1/4 context. So realistically you need to support the same context, so you r compute goes up another 4x to 16x.

That's why