DeepSeek OCR

(github.com)

990 points pierre | 3 comments | 20 Oct 25 06:26 UTC | HN request time: 0.013s | source

Show context

krackers ◴[20 Oct 25 06:57 UTC] No.45640720[source]▶

The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is this quote

>Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10× ratios, while 20× compression still retains 60% accuracy.

(I guess you could say a picture token is worth 10 textual tokens...)

Could someone explain to a noob what the information-theoretic intuition is here? Why does this work, is it that text tokens are still too "granular"/repetitive and don't come close to the ideal entropy coding? Or is switching to vision tokens escaping the limitation of working "one word-ish at a time", allowing you to get closer to entropy (similar to the way that arithmetic encoding does compared to huffman codes)?

And then they start talking about handling long-context by literally(?) downscaling images, forming a correspondence between information loss in the textual domain and the image domain.

replies(7): >>45640731 #>>45641225 #>>45642325 #>>45642598 #>>45643765 #>>45645167 #>>45651976 #

looobay ◴[20 Oct 25 07:01 UTC] No.45640731[source]▶

>>45640720 #

LLMs are compute heavy with quadratic scaling (in compute) per tokens. They are trying to compress text tokens into visual tokens with their VLM.

Maybe they would render texts to an image before tokenizing to reduce the compute cost.

replies(1): >>45640755 #

krackers ◴[20 Oct 25 07:07 UTC] No.45640755[source]▶

>>45640731 #

But naively wouldn't you expect the representation of a piece of text in terms of vision tokens to be roughly the same number of bits (or more) than the representation as textual token? You're changing representation sure, but that by itself doesn't give you any compute advantages unless there is some sparsity/compressability you can take advantage of in the domain you transform to right?

So I guess my question is where is the juice being squeezed from, why does the vision token representation end up being more efficient than text tokens.

replies(6): >>45640784 #>>45640804 #>>45640859 #>>45641233 #>>45641253 #>>45645668 #

1. looobay ◴[20 Oct 25 07:13 UTC] No.45640784[source]▶

>>45640755 #

Vision tokens are a good compression medium because with one vision token you have one vector of N elements, but with textual tokens you have M vectors of N elements, because one vision token represent multiple pixels (and possibly multiple words). This is why its a good compression medium for compute.

It will never be as precise as textual tokens but it can be really good as they show in the paper.

replies(1): >>45640929 #

2. krackers ◴[20 Oct 25 07:42 UTC] No.45640929[source]▶

>>45640784 (TP) #

>with one vision token you have one vector of N elements, but with textual tokens you have M vectors of N elements

Each vision token represents a 16x16 patch, but to fully cover a word you might need multiple vision tokens. So assuming that the embedding size of the vision token and text token is the same `d` (which I think has to be the case for multimodal models), then wouldn't the fair comparison be `x * d` elements for a sentence in terms of vision tokens, and `y * d` for the same sentence in terms of text tokens? I don't see how you could see a priori that x << y (especially by a factor of 10 as quoted in the paper).

That said, if I do experimentally try this by shrinking this very comment down to the smallest font size I can read it at, then seeing how many 16x16 tokens it takes, you can fit more text than I expected in each "vision token". So I can maybe buy that x is at least not greater than y. But it can't be as simple as "each vision token can cover more text", since that only enables better compression if the encoder can actually uncover some sort of redundancy within each token. (And presumably the type of redundancy it uncovers probably isn't something that "classical" compression techniques can exploit, otherwise it seems like it would have been tried by now?).

replies(1): >>45641026 #

3. looobay ◴[20 Oct 25 07:56 UTC] No.45641026[source]▶

>>45640929 #

You should read the 6th page of the paper (and page 5 for architecture breakdown), they show that they are compressing the vision tokens with convolution to keep a strong semantic understanding and keep a small amount of tokens.

But I think it's still experimentall.

↑