(twitter.com)

233 points JnBrymn | 3 comments | 21 Oct 25 17:43 UTC | HN request time: 0.441s | source

https://xcancel.com/karpathy/status/1980397031542989305

1. nl ◴[23 Oct 25 02:03 UTC] No.45677385[source]▶

Kapathy's points are correct (of course).

One thing I like about text tokens though is that it learns some understanding of the text input method (particularly the QWERTY keyboard).

"Hello" and "Hwllo" are closer in semantic space than you'd think because "w" and "e" are next to each other.

This is much easier to see in hand coded spelling models, where you can get better results by including a "keybaord distance" metric along with a string distance metric.

replies(2): >>45677581 #>>45678933 #

2. swyx ◴[23 Oct 25 02:36 UTC] No.45677581[source]▶

>>45677385 (TP) #

im particularly sympathetic to typo learning, which i think gets lost in the synthetic data discussion (mine here https://www.youtube.com/watch?v=yXPPcBlcF8U )

but i think in this case you can still generate typos in images and it'd be learnable. not a hard issue relevant to the OP

3. harperlee ◴[23 Oct 25 06:49 UTC] No.45678933[source]▶

>>45677385 (TP) #

But assuming that pixel input gets us to an AI capable of reading, they would presumably also be able to detect HWLLO as semantically close to HELLO (similarly to H3LL0, or badly handwritten text - although there would be some graphical structure in these latter examples to help). At the end of the day we are capable of identifying that... Might require some more training effort but the result would be more general.

↑

Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?