←back to thread

688 points crescit_eundo | 3 comments | | HN request time: 0s | source
Show context
azeirah ◴[] No.42141993[source]
Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.

I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.

Surprised I don't see more research into radicaly different tokenization.

replies(14): >>42142033 #>>42142384 #>>42143197 #>>42143338 #>>42143381 #>>42144059 #>>42144207 #>>42144582 #>>42144600 #>>42145725 #>>42146419 #>>42146444 #>>42149355 #>>42151016 #
numpad0 ◴[] No.42144207[source]
hot take: LLM tokens is kanji for AI, and just like kanji it works okay sometimes but fails miserably for the task of accurately representating English
replies(2): >>42148388 #>>42150181 #
1. int_19h ◴[] No.42150181[source]
You could absolutely write a tokenizer that would consistently tokenize all distinct English words as distinct tokens, with a 1:1 mapping.

But AFAIK there's no evidence that this actually improves anything, and if you spend that much of the dictionary on one language, it comes at the cost of making the encoding for everything else much less efficient.

replies(1): >>42155255 #
2. numpad0 ◴[] No.42155255[source]
I mean, it just felt to me that current LLM must architecturally favor fixed-length "ideome", like phoneme but for meaning, having conceived under influence of researches in CJK.

And being architecturally based a idea-tic element based, I just casually thought, there could be limits as to how much it can be pushed into perfecting English, that some radical change - not simply dropping tokenization but more fundamental - has to take place at some point.

replies(1): >>42159182 #
3. int_19h ◴[] No.42159182[source]
I don't think it's hard for the LLM to treat a sequence of two tokens as a semantically meaningful unit, though. They have to handle much more complicated dependencies to parse higher-level syntactic structures of the language.