←back to thread

688 points crescit_eundo | 1 comments | | HN request time: 0.65s | source
Show context
azeirah ◴[] No.42141993[source]
Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.

I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.

Surprised I don't see more research into radicaly different tokenization.

replies(14): >>42142033 #>>42142384 #>>42143197 #>>42143338 #>>42143381 #>>42144059 #>>42144207 #>>42144582 #>>42144600 #>>42145725 #>>42146419 #>>42146444 #>>42149355 #>>42151016 #
1. blixt ◴[] No.42145725[source]
I think it's infeasible to train on bytes unfortunately, but yeah it also seems very wrong to use a handwritten and ultimately human version of tokens (if you take a look at the tokenizers out there you'll find fun things like regular expressions to change what is tokenized based on anecdotal evidence).

I keep thinking that if we can turn images into tokens, and we can turn audio into tokens, then surely we can create a set of tokens where the tokens are the model's own chosen representation for semantic (multimodal) meaning, and then decode those tokens back to text[1]. Obviously a big downside would be that the model can no longer 1:1 quote all text it's seen since the encoded tokens would need to be decoded back to text (which would be lossy).

[1] From what I could gather, this is exactly what OpenAI did with images in their gpt-4o report, check out "Explorations of capabilities": https://openai.com/index/hello-gpt-4o/