←back to thread

DeepSeek OCR

(github.com)
990 points pierre | 3 comments | | HN request time: 0.001s | source
Show context
rsp1984 ◴[] No.45644640[source]
Can someone ELI5 to me (someone who doesn't have the time to keep up with all the latest research) what this is and why it's a big deal?

It's very hard to guess from the github and paper. For example, there is OCR in the title but the abstract and readme.md talk about context compression for LLMs, which I find confusing. Somebody care to explain the link and provide some high-level context?

replies(1): >>45649955 #
1. intalentive ◴[] No.45649955[source]
Suppose you have an image with 1000 words in it, and suppose for simplicity that every word is 1 token. Then the image is “worth” 1000 tokens.

But under the hood, the image will have to be transformed into features / embeddings before it can be decoded into text. Suppose that the image gets processed into 100 “image tokens”, which are subsequently decoded into 1000 “text tokens”.

Now forget that we are even talking about images or OCR. If you look at just the decoding process, you find that we were able to compress the output into a 10x smaller representation.

The implication for LLMs is that we don’t need 1000 tokens and 1000 token embeddings to produce the 1001st token, if we can figure out how to compress them into a 10x smaller latent representation first.

replies(1): >>45653785 #
2. rsp1984 ◴[] No.45653785[source]
Excellent, thanks. So basically this is saying: "our pixels-to-token encoding is so efficient (information density in a set of "image tokens" is much higher as compared to a set of text tokens), why even bother representing text as text?"

Correct?

replies(1): >>45657811 #
3. intalentive ◴[] No.45657811[source]
Basically. Some people are even saying, hey, if you encode text as an image then you don’t need tokenizers any more, and you get more expressivity from the graphic styling.

Another takeaway is that you don’t need to pass a tensor of shape (batch_size, sequence_length, d_model) through your transformer. Not every token needs its own dedicated latent embedding. You can presumably get away with dividing sequence_length by a constant.

This isn’t super ground breaking but it does reinforce the validity of a middle ground between recurrent models, where context is compressed into a single “memory token”, and transformers, where context is uncompressed. 1 < n/k < n