> Maybe it makes more sense that all inputs to LLMs should only ever be images.
So, what, every time I want to ask an LLM a question I paint a picture? I mean at that point why not just say "all input to LLMs should be embeddings"?
> Maybe it makes more sense that all inputs to LLMs should only ever be images.
So, what, every time I want to ask an LLM a question I paint a picture? I mean at that point why not just say "all input to LLMs should be embeddings"?
Getting DeepSeek-OCR working on an Nvidia Spark via brute force with Claude Code - https://news.ycombinator.com/item?id=45646559 - Oct 2025 (43 comments)
DeepSeek OCR - https://news.ycombinator.com/item?id=45640594 - Oct 2025 (238 comments)
> Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in:
Italicized emphasis mine.
So he's suggesting that/wondering if the vision model should be the only input to the LLM and have that read the text. So there would be a rasterization step on the text input to generate the image.
Thus, you don't need to draw a picture but generate a raster of the text to feed it to the vision model.
Sequential reading of text is very inefficient.
If we had a million times the compute? We might have brute forced our way to AGI by now.
You can also see this paper from the GLM team where they explicitly test this assumption to some pretty good results.
"Why aren't we doing X?", where X is a thing that sounds sensible, seems like it would help, and does indeed help, and there's even a paper here proving that it helps.
The answer is: check the paper, it says there on page 12 in a throwaway line that they used 3 times the compute for the new method than for the controls. And the gain was +4%.
A lot of promising things are resource hogs, and there are too many better things to burn the GPU-hours on.
is that crazy? I'm not buying it is
- https://addons.mozilla.org/en-US/firefox/addon/toxcancel/
- https://chromewebstore.google.com/detail/xcancelcom-redirect...
One thing I like about text tokens though is that it learns some understanding of the text input method (particularly the QWERTY keyboard).
"Hello" and "Hwllo" are closer in semantic space than you'd think because "w" and "e" are next to each other.
This is much easier to see in hand coded spelling models, where you can get better results by including a "keybaord distance" metric along with a string distance metric.
Tokenizing text is such a hack even though it works pretty well. The state-of-the-art comes out of the gate with an approximation for quantifying language that's wrong on so many levels.
It's difficult to wrap my head around pixels being a more powerful representation of information, but someone's gotta come up with something other than tokenizer.
The relevant technical term is "saccade"
> ADHD: Studies have shown a consistent reduction in ability to suppress unwanted saccades, suggesting an impaired functioning of areas like the dorsolateral prefrontal cortex.
> Autism: An elevated number of antisaccade errors has been consistently reported, which may be due to disturbances in frontal cortical areas.
Also see https://en.wikipedia.org/wiki/Eye_movement_in_reading
but i think in this case you can still generate typos in images and it'd be learnable. not a hard issue relevant to the OP
Rather, I feel like I absorb written meaning in units closer to paragraphs than to words or sentences. I’d describe my rapid up-and-down, back-and-forth eye motions as something closer to going back to soak up more, if that makes sense. To reinterpret it in the context of what came after it. The analogy that comes to mind is to a Progressive JPEG getting crisper as more loads.
That eyewiki entry was really cool. Among the unexpectedly interesting bits:
> The initiation of a saccade takes about 200 milliseconds[4]. Saccades are said to be ballistic because the movements are predetermined at initiation, and the saccade generating system cannot respond to subsequent changes in the position of the target after saccade initiation[4].
It seems to me that would be a tick in the “pro” column for this idea of using pixels (or contours, a la JPEG) as the models’ fundamental stimulus to train against (as opposed to textual tokens). Isn’t there a comparison to be drawn between the “threads” you describe here, and the multi-headed attention mechanisms (or whatever it is) that the LLM models use to weigh associations at various distances between tokens?
Actually thinking more about that I consume “text” as images and also as sounds… I kinda wonder if instead of render and ocr like this suggests we did tts and just encoded like the mp3 sample of the vocalization of the word if that would be less bytes than the rendered pixels version… probably depends on the resolution / sample rate.
But in the DeepSeek-OCR paper, compressing more text into the same number of visual input tokens leads to progressively worse output precision, so it's not a free lunch but a speed-quality tradeoff, and more fine-grained KV cache-compression methods might deliver better speedups without degrading the output as much.