(twitter.com)

237 points JnBrymn | 1 comments | 21 Oct 25 17:43 UTC | HN request time: 0.345s | source

https://xcancel.com/karpathy/status/1980397031542989305

Show context

yunwal ◴[21 Oct 25 20:11 UTC] No.45661042[source]▶

> The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input.

> Maybe it makes more sense that all inputs to LLMs should only ever be images.

So, what, every time I want to ask an LLM a question I paint a picture? I mean at that point why not just say "all input to LLMs should be embeddings"?

replies(4): >>45661392 #>>45675872 #>>45676027 #>>45678135 #

1. CuriouslyC ◴[22 Oct 25 22:34 UTC] No.45676027[source]▶

>>45661042 #

All inputs being embeddings can work if you have embedding like Matryoshka, the hard part is adaptively selecting the embedding size for a given datum.

↑

Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?