←back to thread

DeepSeek OCR

(github.com)
990 points pierre | 4 comments | | HN request time: 0s | source
Show context
krackers ◴[] No.45640720[source]
The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is this quote

>Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10× ratios, while 20× compression still retains 60% accuracy.

(I guess you could say a picture token is worth 10 textual tokens...)

Could someone explain to a noob what the information-theoretic intuition is here? Why does this work, is it that text tokens are still too "granular"/repetitive and don't come close to the ideal entropy coding? Or is switching to vision tokens escaping the limitation of working "one word-ish at a time", allowing you to get closer to entropy (similar to the way that arithmetic encoding does compared to huffman codes)?

And then they start talking about handling long-context by literally(?) downscaling images, forming a correspondence between information loss in the textual domain and the image domain.

replies(7): >>45640731 #>>45641225 #>>45642325 #>>45642598 #>>45643765 #>>45645167 #>>45651976 #
looobay ◴[] No.45640731[source]
LLMs are compute heavy with quadratic scaling (in compute) per tokens. They are trying to compress text tokens into visual tokens with their VLM.

Maybe they would render texts to an image before tokenizing to reduce the compute cost.

replies(1): >>45640755 #
krackers ◴[] No.45640755[source]
But naively wouldn't you expect the representation of a piece of text in terms of vision tokens to be roughly the same number of bits (or more) than the representation as textual token? You're changing representation sure, but that by itself doesn't give you any compute advantages unless there is some sparsity/compressability you can take advantage of in the domain you transform to right?

So I guess my question is where is the juice being squeezed from, why does the vision token representation end up being more efficient than text tokens.

replies(6): >>45640784 #>>45640804 #>>45640859 #>>45641233 #>>45641253 #>>45645668 #
1. f33d5173 ◴[] No.45640859[source]
Vision is how humans see text. So text must have built in adaptations to protect from visual noise. For example, two words that look similar must never appear in similar contexts, or else they would be conflated. Hence we can safely reduce such words to the same token. Or something like that.
replies(2): >>45642499 #>>45642910 #
2. ffsm8 ◴[] No.45642499[source]
Is that really factual/true?

Lots of words have multiple meanings and can mean different things even if used in the same sentence/context just from the interpretation of the person reading it.

Heck, it'd argue that most (not all) dayjob conflicts are down to such differences in interpretation /miscommunications

3. fxtentacle ◴[] No.45642910[source]
That also works purely on text and it's the trick I used in my German speech recognition engine ( https://arxiv.org/abs/2206.12693 ).

"I'm studying at Oxford Univ" has basically no loss in meaning even though "University" was truncated to less than half its characters.

replies(1): >>45645348 #
4. UltraSane ◴[] No.45645348[source]
This is like how many CLIs accept the shortest unique version of commands.