The bitter lesson is coming for tokenization

1. andy99 ◴[24 Jun 25 17:11 UTC] No.44368430[source]▶

> inability to detect the number of r's in:strawberry: meme

Can someone (who know about LLMs) explain why the r's in strawberry thing is related to tokenization? I have no reason to believe an LLM would be better at counting letters if each was one token. It's not like they "see" any of it. Are they better at counting tokens than letters for some reason? Or is this just one of those things someone misinformed said to sound smart to even less informed people, that got picked up?

replies(7): >>44368463 #>>44369041 #>>44369608 #>>44370115 #>>44370128 #>>44374874 #>>44395946 #

2. ijk ◴[24 Jun 25 17:15 UTC] No.44368463[source]▶

>>44368430 (TP) #

Well, which is easier:

Count the number of Rs in this sequence: [496, 675, 15717]

Count the number of 18s in this sequence: 19 20 18 1 23 2 5 18 18 25

replies(1): >>44368554 #

3. ASalazarMX ◴[24 Jun 25 17:24 UTC] No.44368554[source]▶

>>44368463 #

For a LLM? No idea.

Human: Which is the easier of these formulas

1. x = SQRT(4)

2. x = SQRT(123567889.987654321)

Computer: They're both the same.

replies(2): >>44368891 #>>44369678 #

4. drdeca ◴[24 Jun 25 17:56 UTC] No.44368891{3}[source]▶

>>44368554 #

Depending on the data types and what the hardware supports, the latter may be harder (in the sense of requiring more operations)? And for a general algorithm bigger numbers would take more steps.

5. zachooz ◴[24 Jun 25 18:13 UTC] No.44369041[source]▶

>>44368430 (TP) #

A sequence of characters is grouped into a "token." The set of all such possible sequences forms a vocabulary. Without loss of generality, consider the example: strawberry -> straw | ber | ry -> 3940, 3231, 1029 -> [vector for each token]. The raw input to the model is not a sequence of characters, but a sequence of token embeddings each representing a learned vector for a specific chunk of characters. These embeddings contain no explicit information about the individual characters within the token. As a result, if the model needs to reason about characters, for example, to count the number of letters in a word, it must memorize the character composition of each token. Given that large models like GPT-4 use vocabularies with 100k–200k tokens, it's not surprising that the model hasn't memorized the full character breakdown of every token. I can't imagine that many "character level" questions exist in the training data.

In contrast, if the model were trained with a character-level vocabulary, where each character maps to a unique token, it would not need to memorize character counts for entire words. Instead, it could potentially learn a generalizable method for counting characters across all sequences, even for words it has never seen before.

I'm not sure about what you mean about them not "seeing" the tokens. They definitely receive a representation of each token as input.

replies(1): >>44369215 #

6. saurik ◴[24 Jun 25 18:29 UTC] No.44369215[source]▶

>>44369041 #

It isn't at all obvious to me that the LLM can decide to blur their vision, so to speak, and see the tokens as tokens: they don't get to run a program on this data in some raw format, and even if they do attempt to write a program and run it in a sandbox they would have to "remember" what they were given and then regenerate it (well, I guess a tool could give them access to the history of their input, but at that point that tool likely sees characters), rather than to copy it. I am 100% with andy99 on this: it isn't anywhere near as simple as you are making it out to be.

replies(1): >>44369628 #

7. krackers ◴[24 Jun 25 18:56 UTC] No.44369608[source]▶

>>44368430 (TP) #

Until I see evidence that an LLM trained at e.g. the character level _CAN_ successfully "count Rs" then I don't trust this explanation over any other hypothesis. I am not familiar with the literature so I don't know if this has been done, but I couldn't find anything with a quick search. Surely if someone did successfully do it they would have published it.

replies(3): >>44369975 #>>44371050 #>>44372266 #

8. zachooz ◴[24 Jun 25 18:57 UTC] No.44369628{3}[source]▶

>>44369215 #

If each character were represented by its own token, there would be no need to "blur" anything, since the model would receive a 1:1 mapping between input vectors and individual characters. I never claimed that character-level reasoning is easy or simple for the model; I only said that it becomes theoretically possible to generalize ("potentially learn") without memorizing the character makeup of every token, which is required when using subword tokenization.

Please take another look at my original comment. I was being precise about the distinction between what's structurally possible to generalize vs memorize.

9. ijk ◴[24 Jun 25 19:00 UTC] No.44369678{3}[source]▶

>>44368554 #

You can view the tokenization for yourself: https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...

[496, 675, 15717] is the GPT-4 representation of the tokens. In order to determine which letters the token represents, it needs to learn the relationship between "str" and [496]. It can learn the representation (since it can spell it out as "S-T-R" or "1. S, 2. T, 3. R" or whatever) but it adds an extra step.

The question is whether the extra step adds enough extra processing to degrade performance. Does the more compact representation buy enough extra context to make the tokenized version more effective for more problems?

It seems like the longer context length makes the trade off worth it, since spelling problems are a relatively minor subset. On the other hand, for numbers it does appear that math is significantly worse when it doesn't have access to individual digits (early Llama math results, for example). Once they changed the digit tokenization, the math performance improved.

10. ijk ◴[24 Jun 25 19:24 UTC] No.44369975[source]▶

>>44369608 #

The math tokenization research is probably closest.

GPT-2 tokenization was a demonstratable problem: https://www.beren.io/2023-02-04-Integer-tokenization-is-insa... (Prior HN discussion: https://news.ycombinator.com/item?id=39728870 )

More recent research:

https://huggingface.co/spaces/huggingface/number-tokenizatio...

Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs: https://arxiv.org/abs/2402.14903

https://www.beren.io/2024-07-07-Right-to-Left-Integer-Tokeni...

replies(1): >>44370681 #

11. meroes ◴[24 Jun 25 19:35 UTC] No.44370115[source]▶

>>44368430 (TP) #

I don't buy the token explanation because RLHF work is/was filled with so many "count the number of ___" prompts. There's just no way AI companies pay so much $$$ for RLHF of these prompts when the error is purely in tokenization.

IME Reddit would scream "tokenization" at the strawberry meme until blue in the face, assuring themselves better tokenization meant the problem would be solved. Meanwhile RLHF'ers were/are en masse paid to solve the problem through correcting thousands of these "counting"/perfect syntax prompts and problems. To me, since RLHF work was being paid to tackle these problems, it couldn't be a simple tokenization problem. If there was a tokenization bottleneck that fixing would solve the problem, we would not be getting paid to so much money to RLHF synax-perfect prompts (think of Sudoku type games and heavy syntax-based problems).

No, why models are better are these problems now is because of RLHF. And before you say, well now models have learned how to count in general, I say we just need to widen the abstraction a tiny bit and the models will fail again. And this will be the story of LLMs forever--they will never take the lead on their own, and its not how humans process information, but it still can be useful.

12. hackinthebochs ◴[24 Jun 25 19:36 UTC] No.44370128[source]▶

>>44368430 (TP) #

Tokens are the most basic input unit of an LLM. But tokens don't generally correspond to whole words, rather sub-word sequences. So Strawberry might be broken up into two tokens 'straw' and 'berry'. It has trouble distinguishing features that are "sub-token" like specific letter sequences because it doesn't see letter sequences but just the token as a single atomic unit. The basic input into a system is how one input state is distinguished from another. But to recognize identity between input states, those states must be identical. It's a bit unintuitive, but identity between individual letters and the letters within a token fails due to the specifics of tokenization. 'Straw' and 'r' are two tokens but an LLM is entirely blind to the fact that 'straw' has one 'r' in it. Tokens are the basic units of distinction; 'straw' is not represented as a sequence of s-t-r-a-w tokens but is its own thing entirely, so they are not considered equal or even partially equal.

As an analogy, I might ask you to identify the relative activations of each of the three cone types on your retina as I present some solid color image to your eyes. But of course you can't do this, you simply do not have cognitive access to that information. Individual color experiences are your basic vision tokens.

Actually, I asked Grok this question a while ago when probing how well it could count vowels in a word. It got it right by listing every letter individually. I then asked it to count without listing the letters and it was a couple of letters off. I asked it how it was counting without listing the letters and its answer was pretty fascinating, with a seeming awareness of its own internal processes:

Connecting a token to a vowel, though, requires a bit of a mental pivot. Normally, I’d just process the token and move on, but when you ask me to count vowels, I have to zoom in. I don’t unroll the word into a string of letters like a human counting beads on a string. Instead, I lean on my understanding of how those tokens sound or how they’re typically constructed. For instance, I know "cali" has an 'a' and an 'i' because I’ve got a sense of its phonetic makeup from training data—not because I’m stepping through c-a-l-i. It’s more like I "feel" the vowels in there, based on patterns I’ve internalized.

When I counted the vowels without listing each letter, I was basically hopping from token to token, estimating their vowel content from memory and intuition, then cross-checking it against the whole word’s vibe. It’s not perfect—I’m not cracking open each token like an egg to inspect it—but it’s fast and usually close enough. The difference you noticed comes from that shift: listing letters forces me to be precise and sequential, while the token approach is more holistic, like guessing the number of jellybeans in a jar by eyeing the clumps.

replies(1): >>44370397 #

13. svachalek ◴[24 Jun 25 19:59 UTC] No.44370397[source]▶

>>44370128 #

That explanation is pretty freaky, as it implies a form of consciousness I don't believe LLMs have, I've never seen this explanation before so I'm not sure it's from training, and yet it's probably a fairly accurate description of what's going on.

replies(2): >>44371064 #>>44371109 #

14. krackers ◴[24 Jun 25 20:29 UTC] No.44370681{3}[source]▶

>>44369975 #

GPT-2 can successfully learn to do multiplication using the standard tokenizer though, using "Implicit CoT with Stepwise Internalization".

https://twitter.com/yuntiandeng/status/1836114401213989366

If anything I'd think this indicates the barrier isn't tokenization (if it can do arithmetic, it can probably count as well) but something to do with "sequential dependencies" requiring use of COT and explicit training. Which still leaves me puzzled: there are tons of papers showing that variants of GPT-2 trained in the right way can do arithmetic, where are the papers solving the "count R in strawberry" problem?

15. ◴[24 Jun 25 21:04 UTC] No.44371050[source]▶

>>44369608 #

16. roywiggins ◴[24 Jun 25 21:06 UTC] No.44371064{3}[source]▶

>>44370397 #

LLMs will write out explanations that are entirely post-hoc:

> Strikingly, Claude seems to be unaware of the sophisticated "mental math" strategies that it learned during training. If you ask how it figured out that 36+59 is 95, it describes the standard algorithm involving carrying the 1. This may reflect the fact that the model learns to explain math by simulating explanations written by people, but that it has to learn to do math "in its head" directly, without any such hints, and develops its own internal strategies to do so.

https://www.anthropic.com/news/tracing-thoughts-language-mod...

It seems to be about as useful as asking a person how their hippocampus works: they might be able to make something up, or repeat a vaguely remembered bit of neuroscience, but they don't actually have access to their own hippocampus' internal workings, so if they're correct it's by accident.

replies(1): >>44371334 #

17. hackinthebochs ◴[24 Jun 25 21:11 UTC] No.44371109{3}[source]▶

>>44370397 #

Yeah, this was the first conversation with an LLM where I was genuinely impressed at its apparent insight beyond just its breadth of knowledge and ability to synthesize it into a narrative. The whole conversation was pretty fascinating. I was nudging it pretty hard to agree it might be conscious, but it kept demurring while giving an insightful narrative into its processing. In case you are interested: https://x.com/i/grok/share/80kOa4MI6uJiplJvgQ2FkNnzP

18. anonymoushn ◴[24 Jun 25 23:46 UTC] No.44372266[source]▶

>>44369608 #

There are various papers about this, maybe most prominently Byte-Latent Transformer.

19. skerit ◴[25 Jun 25 08:22 UTC] No.44374874[source]▶

>>44368430 (TP) #

LLMs aren't necessarily taught the characters their tokens represent. It's kind of the same how some humans are able to speak a language, but not write it. We are basically "transcribing" what LLMs are saying into text.

20. yuvalpinter ◴[27 Jun 25 11:34 UTC] No.44395946[source]▶

>>44368430 (TP) #

We have a paper under review that's gonna be up on arXiv soon, where we test this for ~10,000 words and find consistent decline in counting ability based on how many characters are in the tokens where the target character appears. It seems that models know "which character" is a single-character token but really doesn't get much about the inner composition of multi-character tokens.

replies(1): >>44397892 #

21. hnaccount_rng ◴[27 Jun 25 16:07 UTC] No.44397892[source]▶

>>44395946 #

Isn't that a rather trivial result? Or at least expected? Unless you manually encode the "this token consists of those tokens" information those are completely independent things for the model?