Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

1. hbarka ◴[22 Oct 25 22:32 UTC] No.45676016[source]▶

Chinese writing is logographic. Could this be giving Chinese developers a better intuition for pixels as input rather than text?

replies(3): >>45676915 #>>45678830 #>>45679059 #

2. anabis ◴[23 Oct 25 00:40 UTC] No.45676915[source]▶

>>45676016 (TP) #

Yeah, mapping chinese characters to linear UTF-8 space is throwing a lot of information away. Each language brings some ideas for text processing. sentencepiece inventor is Japanese, which doesn't have explicit word delimiters, for example.

3. hobofan ◴[23 Oct 25 06:32 UTC] No.45678830[source]▶

>>45676016 (TP) #

Yeah, that sounds quite interesting. I'm wondering whether there is a bigger gap in performance (= quality) between text-only<->vision OCR in Chinese language than in English.

There is indeed a lot of semantic information contained in the signs that should help an LLM. E.g. there is a clear visual connection between 木 (wood/tree) and 林 (forest), while an LLM that purely has to draw a connection between "tree" and "forest" would have a much harder time seeing that connection independent of whether it's fed that as text or vision tokens.

4. est ◴[23 Oct 25 07:08 UTC] No.45679059[source]▶

>>45676016 (TP) #

Chinese text == Method of loci

Many Chinese student have good memory to recall a particular paragraph, understand the meaning, but no idea how those words were pronouced.