←back to thread

237 points JnBrymn | 1 comments | | HN request time: 0.345s | source
Show context
yunwal ◴[] No.45661042[source]
> The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input.

> Maybe it makes more sense that all inputs to LLMs should only ever be images.

So, what, every time I want to ask an LLM a question I paint a picture? I mean at that point why not just say "all input to LLMs should be embeddings"?

replies(4): >>45661392 #>>45675872 #>>45676027 #>>45678135 #
1. CuriouslyC ◴[] No.45676027[source]
All inputs being embeddings can work if you have embedding like Matryoshka, the hard part is adaptively selecting the embedding size for a given datum.