←back to thread

200 points simonw | 3 comments | | HN request time: 0s | source
1. emporas ◴[] No.45665955[source]
The point of DeepSeek-OCR is not an one way image recognition and textual description of the pixels, but text compression to a generated image, and text restoration from that generated image. This video explains it well [1].

> From the paper: Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10×), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20×, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs.

It's main purpose is: a compression algorithm from text to image, throw away the text because it costs too many tokens, keep the image in the context window instead of the text, generate some more text, when text accumulates even more compress the new text to image and so on.

The argument is, pictures store a lot more information than words, "A picture is worth a thousand words" after all. Chinese characters are pictograms, it doesn't seem that strange to think that, but I don't buy it.

I am doing some experiments of removing text as an input for LLMs and replacing with it's summary, and I have reduced the context window by 7 times already. I am still figuring what it the best way to achieve that, but 10 times is not far off. My experiments involve novel writing not general stuff, but still it works very well just replacing text with it's summary.

If an image is worth so many words, why not use it for programming after all? There we go, visual programming again!

[1] https://www.youtube.com/watch?v=YEZHU4LSUfU

replies(1): >>45666226 #
2. CaptainOfCoit ◴[] No.45666226[source]
> If an image is worth so many words, why not use it for programming after all? There we go, visual programming again!

I mean, some really dense languages basically do, like APL using symbols we (non-mathematicians) rarely even see.

replies(1): >>45666767 #
3. emporas ◴[] No.45666767[source]
Combinators are math though. There is a section in the paper that covers the topic of graphs and charts, transforming them to text and then back again to image. They claim 97% precision.

> within a 10× compression ratio, the model’s decoding precision can reach approximately 97%, which is a very promising result. In the future, it may be possible to achieve nearly 10× lossless contexts compression through text-to-image approaches.

Graphs and charts should be represented as math, i.e. text, that's what they are anyway, even when they are represented as images, it is much more economical to be represented as math.

The function f(x)=x can be represented by an image of (10pixels x 10pixels) dimensions, (100pixels x 100pixels) or (infinite pixels x infinite pixels).

A math function is worth infinite pictures.