←back to thread

688 points crescit_eundo | 2 comments | | HN request time: 0.004s | source
Show context
azeirah ◴[] No.42141993[source]
Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.

I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.

Surprised I don't see more research into radicaly different tokenization.

replies(14): >>42142033 #>>42142384 #>>42143197 #>>42143338 #>>42143381 #>>42144059 #>>42144207 #>>42144582 #>>42144600 #>>42145725 #>>42146419 #>>42146444 #>>42149355 #>>42151016 #
aithrowawaycomm ◴[] No.42142384[source]
FWIW I think most of the "tokenization problems" are in fact reasoning problems being falsely blamed on a minor technical thing when the issue is much more profound.

E.g. I still see people claiming that LLMs are bad at basic counting because of tokenization, but the same LLM counts perfectly well if you use chain-of-thought prompting. So it can't be explained by tokenization! The problem is reasoning: the LLM needs a human to tell it that a counting problem can be accurately solved if they go step-by-step. Without this assistance the LLM is likely to simply guess.

replies(6): >>42142733 #>>42142807 #>>42143239 #>>42143800 #>>42144596 #>>42146428 #
ipsum2 ◴[] No.42142733[source]
The more obvious alternative is that CoT is making up for the deficiencies in tokenization, which I believe is the case.
replies(1): >>42142913 #
aithrowawaycomm ◴[] No.42142913{3}[source]
I think the more obvious explanation has to do with computational complexity: counting is an O(n) problem, but transformer LLMs can’t solve O(n) problems unless you use CoT prompting: https://arxiv.org/abs/2310.07923
replies(2): >>42143402 #>>42150368 #
ipsum2 ◴[] No.42143402{4}[source]
What you're saying is an explanation what I said, but I agree with you ;)
replies(1): >>42148535 #
1. aithrowawaycomm ◴[] No.42148535{5}[source]
No, it's a rebuttal of what you said: CoT is not making up for a deficiency in tokenization, it's making up for a deficiency in transformers themselves. These complexity results have nothing to do with tokenization, or even LLMs, it is about the complexity class of problems that can be solved by transformers.
replies(1): >>42150513 #
2. ipsum2 ◴[] No.42150513[source]
There's a really obvious way to test whether the strawberry issue is tokenization - replace each letter with a number, then ask chatGPT to count the number of 3s.

Count the number of 3s, only output a single number: 6 5 3 2 8 7 1 3 3 9.

ChatGPT: 3.