←back to thread

DeepSeek OCR

(github.com)
990 points pierre | 1 comments | | HN request time: 0.304s | source
Show context
krackers ◴[] No.45640720[source]
The paper is more interesting than just another VLM for OCR, they start talking about compression and stuff. E.g. there is this quote

>Our work represents an initial exploration into the boundaries of vision-text compression, investigating how many vision tokens are required to decode text tokens. The preliminary results are encouraging: DeepSeek-OCR achieves near-lossless OCR compression at approximately 10× ratios, while 20× compression still retains 60% accuracy.

(I guess you could say a picture token is worth 10 textual tokens...)

Could someone explain to a noob what the information-theoretic intuition is here? Why does this work, is it that text tokens are still too "granular"/repetitive and don't come close to the ideal entropy coding? Or is switching to vision tokens escaping the limitation of working "one word-ish at a time", allowing you to get closer to entropy (similar to the way that arithmetic encoding does compared to huffman codes)?

And then they start talking about handling long-context by literally(?) downscaling images, forming a correspondence between information loss in the textual domain and the image domain.

replies(7): >>45640731 #>>45641225 #>>45642325 #>>45642598 #>>45643765 #>>45645167 #>>45651976 #
1. HarHarVeryFunny ◴[] No.45643765[source]
I don't know if there is any common practice among multi-modal input "LLM"s as to how they encode image inputs - convert them into "vision tokens", but it's basically going to come down to splitting the image into a grid of regions and encoding those.

I'm not sure there's any information theoretic intuition to be had with DeepSeek's experiments - it seems to be more about what's the lowest resolution image resolution/grid you can get away with and still capture enough image detail to be able to accurately perform OCR on it.

It'd be cool if Karpathy would extend his NanoChat to be multi-modal to spread the knowledge of how this is typically done.