←back to thread

DeepSeek OCR

(github.com)
990 points pierre | 1 comments | | HN request time: 0s | source
Show context
krackers ◴[] No.45640720[source]
The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is this quote

>Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10× ratios, while 20× compression still retains 60% accuracy.

(I guess you could say a picture token is worth 10 textual tokens...)

Could someone explain to a noob what the information-theoretic intuition is here? Why does this work, is it that text tokens are still too "granular"/repetitive and don't come close to the ideal entropy coding? Or is switching to vision tokens escaping the limitation of working "one word-ish at a time", allowing you to get closer to entropy (similar to the way that arithmetic encoding does compared to huffman codes)?

And then they start talking about handling long-context by literally(?) downscaling images, forming a correspondence between information loss in the textual domain and the image domain.

replies(7): >>45640731 #>>45641225 #>>45642325 #>>45642598 #>>45643765 #>>45645167 #>>45651976 #
miki123211 ◴[] No.45642598[source]
Text tokens are quantized and represent subword units, vision tokens only exist in the embedding space.

The way text tokenization works in LLMs is that you have a "lookup table" of (small) token ids to (large) vector embeddings. To pass text to the LLM, you split it at token boundaries, convert strings to token ids, and then construct the "context", a matrix where each row is a vector taken from that lookup table.

Transmitting text token sequences can be relatively efficient, you just transmit the token IDs themselves[1]. They're small integers (~100k possible token ids is typical for large models). Transmitting the actual embeddings matrix would be far less efficient, as embeddings often consist of thousands of floating point numbers.

Images are encoded differently. After some basic preprocessing, image data is passed straight to a neural- network-based image encoder. That encoder encodes the image into vectors, which are then appended to the context. There are no token ids, there's no lookup table, we go straight from image data to token embeddings.

This means transmitting image tokens cannot be done as efficiently, as you'd have to transmit the embeddings themselves. Even though an image is encoded in fewer tokens, the most efficient representation of those tokens takes more bytes.

You can think of a text token as an integer between 0 and n, which we know how to map to a vector. This means you have `n` possible choices of tokens. In contrast, an image token is an array of m floating point numbers (the vector itself), each of which can take on many possible values. This means the "token space" of vision tokens is actually much larger.

There's also the issue of patterns. Text tokens correspond directly to a contiguous span of UTF-8 bytes, and most tokenizers won't create tokens that span word boundaries. This means they can't encode global patterns efficiently. You can't have a "Hamlet's monologue" or "the text that follows is in Spanish" token.

replies(7): >>45642855 #>>45644597 #>>45644656 #>>45645879 #>>45646092 #>>45646666 #>>45676071 #
ttul ◴[] No.45644656[source]
This is a great summary. If you think about it a bit, text is an expanded representation of concepts meant for display on a two-dimensional surface that can then be read back by human eyes; our brains convert the two-dimensional information into concepts again.

So to me it’s not a surprise that you can transform the two-dimensional representation of the same information into concepts again without losing much.

The paper talks about using this approach to generate large amounts of LLM training data rapidly. That’s intriguing. It suggests that one of the best ways of training models on a wide variety of input data with very long context is to provide it with an image representation instead of text tokens.

replies(1): >>45645242 #
1. miki123211 ◴[] No.45645242{3}[source]
Text is actually one-dimensional, writing is two-dimensional.

To a pure LLM, characters 15 and 16 at line 1 are considered adjacent, but there's no relationship between character 15 of line 1 and character 15 of line 2.

A vision model (which considers text as squiggles, not UTF8 codepoints), such a relationship does exist.