Most active commenters

    ←back to thread

    DeepSeek OCR

    (github.com)
    990 points pierre | 14 comments | | HN request time: 0.001s | source | bottom
    Show context
    krackers ◴[] No.45640720[source]
    The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is this quote

    >Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10× ratios, while 20× compression still retains 60% accuracy.

    (I guess you could say a picture token is worth 10 textual tokens...)

    Could someone explain to a noob what the information-theoretic intuition is here? Why does this work, is it that text tokens are still too "granular"/repetitive and don't come close to the ideal entropy coding? Or is switching to vision tokens escaping the limitation of working "one word-ish at a time", allowing you to get closer to entropy (similar to the way that arithmetic encoding does compared to huffman codes)?

    And then they start talking about handling long-context by literally(?) downscaling images, forming a correspondence between information loss in the textual domain and the image domain.

    replies(7): >>45640731 #>>45641225 #>>45642325 #>>45642598 #>>45643765 #>>45645167 #>>45651976 #
    1. miki123211 ◴[] No.45642598[source]
    Text tokens are quantized and represent subword units, vision tokens only exist in the embedding space.

    The way text tokenization works in LLMs is that you have a "lookup table" of (small) token ids to (large) vector embeddings. To pass text to the LLM, you split it at token boundaries, convert strings to token ids, and then construct the "context", a matrix where each row is a vector taken from that lookup table.

    Transmitting text token sequences can be relatively efficient, you just transmit the token IDs themselves[1]. They're small integers (~100k possible token ids is typical for large models). Transmitting the actual embeddings matrix would be far less efficient, as embeddings often consist of thousands of floating point numbers.

    Images are encoded differently. After some basic preprocessing, image data is passed straight to a neural- network-based image encoder. That encoder encodes the image into vectors, which are then appended to the context. There are no token ids, there's no lookup table, we go straight from image data to token embeddings.

    This means transmitting image tokens cannot be done as efficiently, as you'd have to transmit the embeddings themselves. Even though an image is encoded in fewer tokens, the most efficient representation of those tokens takes more bytes.

    You can think of a text token as an integer between 0 and n, which we know how to map to a vector. This means you have `n` possible choices of tokens. In contrast, an image token is an array of m floating point numbers (the vector itself), each of which can take on many possible values. This means the "token space" of vision tokens is actually much larger.

    There's also the issue of patterns. Text tokens correspond directly to a contiguous span of UTF-8 bytes, and most tokenizers won't create tokens that span word boundaries. This means they can't encode global patterns efficiently. You can't have a "Hamlet's monologue" or "the text that follows is in Spanish" token.

    replies(7): >>45642855 #>>45644597 #>>45644656 #>>45645879 #>>45646092 #>>45646666 #>>45676071 #
    2. rco8786 ◴[] No.45642855[source]
    Great explanation, thanks. I was surprised to hear that models still only work with ~100k tokens, but after giving it some thought it makes sense. There's only so many words/subword units that get used in any given language. The entropy comes from all the billions of different ways those subwords can be ordered.
    replies(3): >>45643568 #>>45644884 #>>45688010 #
    3. freeqaz ◴[] No.45643568[source]
    There is also a tradeoff between different vocabulary sizes (how many entries exist in the token -> embedding lookup table) that inform the current shape of tokenizers and LLMs. (Below is my semi-armchair stance, but you can read more in depth here[0][1].)

    If you tokenized at the character level ('a' -> embedding) then your vocabulary size would be small, but you'd have more tokens required to represent most content. (And context scales non-linearly, iirc, like n^3) This would also be a bit more 'fuzzy' in terms of teaching the LLM to understand what a specific token should 'mean'. The letter 'a' appears in a _lot_ of different words, and it's more ambiguous for the LLM.

    On the flip side: What if you had one entry in the tokenizer's vocabulary for each word that existed? Well, it'd be far more than the ~100k entries used by popular LLMs, and that has some computational tradeoffs like when you calculate the probability of each 'next' token via softmax, you'd have to run that for each token, as well as increasing the size of certain layers within the LLM (more memory + compute required for each token, basically).

    Additionally, you run into a new problem: 'Rare Tokens'. Basically, if you have infinite tokens, you'll run into specific tokens that only appear a handful of times in the training data and the model is never able to fully imbue the tokens with enough meaning for them to _help_ the model during inference. (A specific example being somebody's username on the internet.)

    Fun fact: These rare tokens, often called 'Glitch Tokens'[2], have been used for all sorts of shenanigans[3] as humans learn to break these models. (This is my interest in this as somebody who works in AI security)

    As LLMs have improved, models have pushed towards the largest vocabulary they can get away with without hurting performance. This is about where my knowledge on the subject ends, but there have been many analyses done to try to compute the optimal vocabulary size. (See the links below)

    One area that I have been spending a lot of time thinking about is what Tokenization looks like if we start trying to represent 'higher order' concepts without using human vocabulary for them. One example being: Tokenizing on LLVM bytecode (to represent code more 'densely' than UTF-8) or directly against the final layers of state in a small LLM (trying to use a small LLM to 'grok' the meaning and hoist it into a more dense, almost compressed latent space that the large LLM can understand).

    It would be cool if Claude Code, when it's talking to the big, non-local model, was able to make an MCP call to a model running on your laptop to say 'hey, go through all of the code and give me the general vibe of each file, then append those tokens to the conversation'. It'd be a lot fewer tokens than just directly uploading all of the code, and it _feels_ like it would be better than uploading chunks of code based on regex like it does today...

    This immediately makes the model's inner state (even more) opaque to outside analysis though. e.g., like why using gRPC as the protocol for your JavaScript front-end sucks: Humans can't debug it anymore without other tooling. JSON is verbose as hell, but it's simple and I can debug my REST API with just network inspector. I don't need access to the underlying Protobuf files to understand what each byte means in my gRPC messages. That's a nice property to have when reviewing my ChatGPT logs too :P

    Exciting times!

    0: https://www.rohan-paul.com/p/tutorial-balancing-vocabulary-s...

    1: https://arxiv.org/html/2407.13623v1

    2: https://en.wikipedia.org/wiki/Glitch_token

    3: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldm...

    replies(1): >>45644350 #
    4. rco8786 ◴[] No.45644350{3}[source]
    Again, super interesting thanks!

    > One area that I have been spending a lot of time thinking about is what Tokenization looks like if we start trying to represent 'higher order' concepts without using human vocabulary for them. One example being: Tokenizing on LLVM bytecode (to represent code more 'densely' than UTF-8)

    I've had similar ideas in the past. High level languages that humans write are designed for humans. What does an "LLM native" programming language look like? And, to your point about protobufs vs JSON, how does a human debug it when the LLM gets stuck?

    > It would be cool if Claude Code, when it's talking to the big, non-local model, was able to make an MCP call to a model running on your laptop to say 'hey, go through all of the code and give me the general vibe of each file, then append those tokens to the conversation'. It'd be a lot fewer tokens than just directly uploading all of the code, and it _feels_ like it would be better than uploading chunks of code based on regex like it does today...

    That's basically the strategy for Claude's new "Skills" feature, just in a more dynamic/AI driven way. Claude will do semantic search through YAML frontmatter to determine what skill might be useful in a given context, then load that entire skill file into context to execute it. Your idea here is similar, use a small local model to summarize each file (basically dynamically generate that YAML front matter), feed those into the larger model's context, and then it can choose which file(s) it cares about based on that.

    5. jph00 ◴[] No.45644597[source]
    Actually there are VAEs which use a codebook approach to creating discrete tokens instead of float vectors. There has been some success in that direction in diffusion models for instance.
    6. ttul ◴[] No.45644656[source]
    This is a great summary. If you think about it a bit, text is an expanded representation of concepts meant for display on a two-dimensional surface that can then be read back by human eyes; our brains convert the two-dimensional information into concepts again.

    So to me it’s not a surprise that you can transform the two-dimensional representation of the same information into concepts again without losing much.

    The paper talks about using this approach to generate large amounts of LLM training data rapidly. That’s intriguing. It suggests that one of the best ways of training models on a wide variety of input data with very long context is to provide it with an image representation instead of text tokens.

    replies(1): >>45645242 #
    7. jerf ◴[] No.45644884[source]
    Textual language is really, really amazing if you sit down and think about what it does versus the resources it consumes to do it.

    It's a common pasttime for programmers to claim that our textual programming languages are just terrible and need to be replaced somehow with something visual, but I think this very often comes from a place of not understanding just how amazing textual languages are. Not they couldn't possibly be improved by something in at least some domains, and there are after all some successful niches for visual languages, but I think if you set out to wholesale replace textual languages without an understanding of and appreciation for the impressive nature of the competition they offer you're setting yourself up to fail.

    replies(1): >>45651557 #
    8. miki123211 ◴[] No.45645242[source]
    Text is actually one-dimensional, writing is two-dimensional.

    To a pure LLM, characters 15 and 16 at line 1 are considered adjacent, but there's no relationship between character 15 of line 1 and character 15 of line 2.

    A vision model (which considers text as squiggles, not UTF8 codepoints), such a relationship does exist.

    9. isaacfung ◴[] No.45645879[source]
    Some models use vector quantized variational autoencoders to discretize images into sequences of discrete symbols from a fixed codebook.

    https://grok.com/share/bGVnYWN5LWNvcHk%3D_572b4955-6265-4210...

    10. lubesGordi ◴[] No.45646092[source]
    So in terms of OCR, does the neural network 'map' the words into an embedding directly, or is it getting a bunch of words like "Hamlet's monologue" and mapping that to an embedding? Basically what I'm asking is if the neural network image encoder is essentially doing OCR 'internally' when it is coming up with the embedding (if that makes any sense).
    11. storus ◴[] No.45646666[source]
    That's not really true, the latest autoregressive image models create a codebook of patches that are then encoded as tokens and image is assembled out of them.
    12. mbando ◴[] No.45651557{3}[source]
    This also touches on the contrast between how human beings and LLM's trade compression for nuance. Human beings have enormous resources devoted to long-tailed distribution of information, for example in lexical items. Word distributions follow Zipf's Law, so like in the million word FROWN corpus, roughly half the words only occur one time. Like when's the last time you use the word chrysanthemum, or corpulent? But did you have any difficulty recognizing them? So while human beings have limited scale compared to machines, we do have an enormous capacity for nuanced, communication and conception.

    Whereas LLM's make the opposite trade-off. There are information centric theory limitations on the amount of information LM's can store (roughly 3.6 bits per parameter) so they aggressively compress information and trade away nuance (https://arxiv.org/abs/2505.17117).

    13. krackers ◴[] No.45676071[source]
    Thank you, this makes sense! As [1] puts it pithily

    >Image-patch tokens make better use of the high-dimensional embedding space than text tokens do.

    That seems to imply it's not necessarily something unique about images, just a byproduct of having better conversion from "raw input -> embeddings" [2]. Although there is a certain elegance of handling both images and text with the same method.

    [1] https://twitter.com/c0mbinat0r/status/1980698103234891892 [2] https://twitter.com/Kangwook_Lee/status/1980709454522744902

    14. davidguetta ◴[] No.45688010[source]
    Theres almost an infinity of chess games possible from barely 32 pièces and simple moves.

    Already at move 3 you have bilions of positions possible