DeepSeek OCR

(github.com)

990 points pierre | 1 comments | 20 Oct 25 06:26 UTC | HN request time: 0.275s | source

Show context

krackers ◴[20 Oct 25 06:57 UTC] No.45640720[source]▶

The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is this quote

>Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10× ratios, while 20× compression still retains 60% accuracy.

(I guess you could say a picture token is worth 10 textual tokens...)

Could someone explain to a noob what the information-theoretic intuition is here? Why does this work, is it that text tokens are still too "granular"/repetitive and don't come close to the ideal entropy coding? Or is switching to vision tokens escaping the limitation of working "one word-ish at a time", allowing you to get closer to entropy (similar to the way that arithmetic encoding does compared to huffman codes)?

And then they start talking about handling long-context by literally(?) downscaling images, forming a correspondence between information loss in the textual domain and the image domain.

replies(7): >>45640731 #>>45641225 #>>45642325 #>>45642598 #>>45643765 #>>45645167 #>>45651976 #

miki123211 ◴[20 Oct 25 11:13 UTC] No.45642598[source]▶

>>45640720 #

Text tokens are quantized and represent subword units, vision tokens only exist in the embedding space.

The way text tokenization works in LLMs is that you have a "lookup table" of (small) token ids to (large) vector embeddings. To pass text to the LLM, you split it at token boundaries, convert strings to token ids, and then construct the "context", a matrix where each row is a vector taken from that lookup table.

Transmitting text token sequences can be relatively efficient, you just transmit the token IDs themselves[1]. They're small integers (~100k possible token ids is typical for large models). Transmitting the actual embeddings matrix would be far less efficient, as embeddings often consist of thousands of floating point numbers.

Images are encoded differently. After some basic preprocessing, image data is passed straight to a neural- network-based image encoder. That encoder encodes the image into vectors, which are then appended to the context. There are no token ids, there's no lookup table, we go straight from image data to token embeddings.

This means transmitting image tokens cannot be done as efficiently, as you'd have to transmit the embeddings themselves. Even though an image is encoded in fewer tokens, the most efficient representation of those tokens takes more bytes.

You can think of a text token as an integer between 0 and n, which we know how to map to a vector. This means you have `n` possible choices of tokens. In contrast, an image token is an array of m floating point numbers (the vector itself), each of which can take on many possible values. This means the "token space" of vision tokens is actually much larger.

There's also the issue of patterns. Text tokens correspond directly to a contiguous span of UTF-8 bytes, and most tokenizers won't create tokens that span word boundaries. This means they can't encode global patterns efficiently. You can't have a "Hamlet's monologue" or "the text that follows is in Spanish" token.

replies(7): >>45642855 #>>45644597 #>>45644656 #>>45645879 #>>45646092 #>>45646666 #>>45676071 #

rco8786 ◴[20 Oct 25 11:50 UTC] No.45642855[source]▶

>>45642598 #

Great explanation, thanks. I was surprised to hear that models still only work with ~100k tokens, but after giving it some thought it makes sense. There's only so many words/subword units that get used in any given language. The entropy comes from all the billions of different ways those subwords can be ordered.

replies(3): >>45643568 #>>45644884 #>>45688010 #

jerf ◴[20 Oct 25 15:14 UTC] No.45644884[source]▶

>>45642855 #

Textual language is really, really amazing if you sit down and think about what it does versus the resources it consumes to do it.

It's a common pasttime for programmers to claim that our textual programming languages are just terrible and need to be replaced somehow with something visual, but I think this very often comes from a place of not understanding just how amazing textual languages are. Not they couldn't possibly be improved by something in at least some domains, and there are after all some successful niches for visual languages, but I think if you set out to wholesale replace textual languages without an understanding of and appreciation for the impressive nature of the competition they offer you're setting yourself up to fail.

replies(1): >>45651557 #

1. mbando ◴[21 Oct 25 01:43 UTC] No.45651557[source]▶

>>45644884 #

This also touches on the contrast between how human beings and LLM's trade compression for nuance. Human beings have enormous resources devoted to long-tailed distribution of information, for example in lexical items. Word distributions follow Zipf's Law, so like in the million word FROWN corpus, roughly half the words only occur one time. Like when's the last time you use the word chrysanthemum, or corpulent? But did you have any difficulty recognizing them? So while human beings have limited scale compared to machines, we do have an enormous capacity for nuanced, communication and conception.

Whereas LLM's make the opposite trade-off. There are information centric theory limitations on the amount of information LM's can store (roughly 3.6 bits per parameter) so they aggressively compress information and trade away nuance (https://arxiv.org/abs/2505.17117).

↑