←back to thread

296 points todsacerdoti | 1 comments | | HN request time: 0s | source
Show context
andy99 ◴[] No.44368430[source]
> inability to detect the number of r's in:strawberry: meme

Can someone (who know about LLMs) explain why the r's in strawberry thing is related to tokenization? I have no reason to believe an LLM would be better at counting letters if each was one token. It's not like they "see" any of it. Are they better at counting tokens than letters for some reason? Or is this just one of those things someone misinformed said to sound smart to even less informed people, that got picked up?

replies(7): >>44368463 #>>44369041 #>>44369608 #>>44370115 #>>44370128 #>>44374874 #>>44395946 #
ijk ◴[] No.44368463[source]
Well, which is easier:

Count the number of Rs in this sequence: [496, 675, 15717]

Count the number of 18s in this sequence: 19 20 18 1 23 2 5 18 18 25

replies(1): >>44368554 #
ASalazarMX ◴[] No.44368554[source]
For a LLM? No idea.

Human: Which is the easier of these formulas

1. x = SQRT(4)

2. x = SQRT(123567889.987654321)

Computer: They're both the same.

replies(2): >>44368891 #>>44369678 #
1. ijk ◴[] No.44369678{3}[source]
You can view the tokenization for yourself: https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...

[496, 675, 15717] is the GPT-4 representation of the tokens. In order to determine which letters the token represents, it needs to learn the relationship between "str" and [496]. It can learn the representation (since it can spell it out as "S-T-R" or "1. S, 2. T, 3. R" or whatever) but it adds an extra step.

The question is whether the extra step adds enough extra processing to degrade performance. Does the more compact representation buy enough extra context to make the tokenized version more effective for more problems?

It seems like the longer context length makes the trade off worth it, since spelling problems are a relatively minor subset. On the other hand, for numbers it does appear that math is significantly worse when it doesn't have access to individual digits (early Llama math results, for example). Once they changed the digit tokenization, the math performance improved.