(twitter.com)

233 points JnBrymn | 1 comments | 21 Oct 25 17:43 UTC | HN request time: 0.2s | source

https://xcancel.com/karpathy/status/1980397031542989305

Show context

yunwal ◴[21 Oct 25 20:11 UTC] No.45661042[source]▶

> The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input.

> Maybe it makes more sense that all inputs to LLMs should only ever be images.

So, what, every time I want to ask an LLM a question I paint a picture? I mean at that point why not just say "all input to LLMs should be embeddings"?

replies(4): >>45661392 #>>45675872 #>>45676027 #>>45678135 #

1. fspeech ◴[22 Oct 25 22:18 UTC] No.45675872[source]▶

>>45661042 #

If you can read your input on your screen your computer apparently knows how to convert your texts to images.

↑

Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?