The bitter lesson is coming for tokenization

(lucalp.dev)

296 points todsacerdoti | 1 comments | 24 Jun 25 14:14 UTC | HN request time: 0s | source

Show context

andy99 ◴[24 Jun 25 17:11 UTC] No.44368430[source]▶

> inability to detect the number of r's in:strawberry: meme

Can someone (who know about LLMs) explain why the r's in strawberry thing is related to tokenization? I have no reason to believe an LLM would be better at counting letters if each was one token. It's not like they "see" any of it. Are they better at counting tokens than letters for some reason? Or is this just one of those things someone misinformed said to sound smart to even less informed people, that got picked up?

replies(7): >>44368463 #>>44369041 #>>44369608 #>>44370115 #>>44370128 #>>44374874 #>>44395946 #

ijk ◴[24 Jun 25 17:15 UTC] No.44368463[source]▶

>>44368430 #

Well, which is easier:

Count the number of Rs in this sequence: [496, 675, 15717]

Count the number of 18s in this sequence: 19 20 18 1 23 2 5 18 18 25

replies(1): >>44368554 #

ASalazarMX ◴[24 Jun 25 17:24 UTC] No.44368554[source]▶

>>44368463 #

For a LLM? No idea.

Human: Which is the easier of these formulas

1. x = SQRT(4)

2. x = SQRT(123567889.987654321)

Computer: They're both the same.

replies(2): >>44368891 #>>44369678 #

1. ijk ◴[24 Jun 25 19:00 UTC] No.44369678{3}[source]▶

>>44368554 #

You can view the tokenization for yourself: https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...

[496, 675, 15717] is the GPT-4 representation of the tokens. In order to determine which letters the token represents, it needs to learn the relationship between "str" and [496]. It can learn the representation (since it can spell it out as "S-T-R" or "1. S, 2. T, 3. R" or whatever) but it adds an extra step.

The question is whether the extra step adds enough extra processing to degrade performance. Does the more compact representation buy enough extra context to make the tokenized version more effective for more problems?

It seems like the longer context length makes the trade off worth it, since spelling problems are a relatively minor subset. On the other hand, for numbers it does appear that math is significantly worse when it doesn't have access to individual digits (early Llama math results, for example). Once they changed the digit tokenization, the math performance improved.

↑