DeepSeek OCR

(github.com)

990 points pierre | 3 comments | 20 Oct 25 06:26 UTC | HN request time: 0.001s | source

Show context

rsp1984 ◴[20 Oct 25 14:54 UTC] No.45644640[source]▶

Can someone ELI5 to me (someone who doesn't have the time to keep up with all the latest research) what this is and why it's a big deal?

It's very hard to guess from the github and paper. For example, there is OCR in the title but the abstract and readme.md talk about context compression for LLMs, which I find confusing. Somebody care to explain the link and provide some high-level context?

replies(1): >>45649955 #

1. intalentive ◴[20 Oct 25 21:58 UTC] No.45649955[source]▶

>>45644640 #

Suppose you have an image with 1000 words in it, and suppose for simplicity that every word is 1 token. Then the image is “worth” 1000 tokens.

But under the hood, the image will have to be transformed into features / embeddings before it can be decoded into text. Suppose that the image gets processed into 100 “image tokens”, which are subsequently decoded into 1000 “text tokens”.

Now forget that we are even talking about images or OCR. If you look at just the decoding process, you find that we were able to compress the output into a 10x smaller representation.

The implication for LLMs is that we don’t need 1000 tokens and 1000 token embeddings to produce the 1001st token, if we can figure out how to compress them into a 10x smaller latent representation first.

replies(1): >>45653785 #

2. rsp1984 ◴[21 Oct 25 08:49 UTC] No.45653785[source]▶

>>45649955 (TP) #

Excellent, thanks. So basically this is saying: "our pixels-to-token encoding is so efficient (information density in a set of "image tokens" is much higher as compared to a set of text tokens), why even bother representing text as text?"

Correct?

replies(1): >>45657811 #

3. intalentive ◴[21 Oct 25 16:30 UTC] No.45657811[source]▶

>>45653785 #

Basically. Some people are even saying, hey, if you encode text as an image then you don’t need tokenizers any more, and you get more expressivity from the graphic styling.

Another takeaway is that you don’t need to pass a tensor of shape (batch_size, sequence_length, d_model) through your transformer. Not every token needs its own dedicated latent embedding. You can presumably get away with dividing sequence_length by a constant.

This isn’t super ground breaking but it does reinforce the validity of a middle ground between recurrent models, where context is compressed into a single “memory token”, and transformers, where context is uncompressed. 1 < n/k < n

↑