(twitter.com)

237 points JnBrymn | 1 comments | 21 Oct 25 17:43 UTC | HN request time: 0.225s | source

https://xcancel.com/karpathy/status/1980397031542989305

Show context

sabareesh ◴[22 Oct 25 22:18 UTC] No.45675879[source]▶

It might be that our current tokenization is inefficient compared to how well image pipeline does. Language already does lot of compression but there might be even better way to represent it in latent space

replies(3): >>45675953 #>>45676049 #>>45677115 #

CuriouslyC ◴[22 Oct 25 22:37 UTC] No.45676049[source]▶

>>45675879 #

Image models use "larger" tokens. You can get this effect with text tokens if you use a larger token dictionary and generate common n-gram tokens, but the current LLM architecture isn't friendly to large output distributions.

replies(2): >>45677876 #>>45677936 #

1. mark_l_watson ◴[23 Oct 25 03:29 UTC] No.45677876[source]▶

>>45676049 #

Interesting idea! Haven’t heard that before.

↑

Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?