The bitter lesson is coming for tokenization

(lucalp.dev)

296 points todsacerdoti | 5 comments | 24 Jun 25 14:14 UTC | HN request time: 0.001s | source

Show context

andy99 ◴[24 Jun 25 17:11 UTC] No.44368430[source]▶

> inability to detect the number of r's in:strawberry: meme

Can someone (who know about LLMs) explain why the r's in strawberry thing is related to tokenization? I have no reason to believe an LLM would be better at counting letters if each was one token. It's not like they "see" any of it. Are they better at counting tokens than letters for some reason? Or is this just one of those things someone misinformed said to sound smart to even less informed people, that got picked up?

replies(7): >>44368463 #>>44369041 #>>44369608 #>>44370115 #>>44370128 #>>44374874 #>>44395946 #

1. krackers ◴[24 Jun 25 18:56 UTC] No.44369608[source]▶

>>44368430 #

Until I see evidence that an LLM trained at e.g. the character level _CAN_ successfully "count Rs" then I don't trust this explanation over any other hypothesis. I am not familiar with the literature so I don't know if this has been done, but I couldn't find anything with a quick search. Surely if someone did successfully do it they would have published it.

replies(3): >>44369975 #>>44371050 #>>44372266 #

2. ijk ◴[24 Jun 25 19:24 UTC] No.44369975[source]▶

>>44369608 (TP) #

The math tokenization research is probably closest.

GPT-2 tokenization was a demonstratable problem: https://www.beren.io/2023-02-04-Integer-tokenization-is-insa... (Prior HN discussion: https://news.ycombinator.com/item?id=39728870 )

More recent research:

https://huggingface.co/spaces/huggingface/number-tokenizatio...

Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs: https://arxiv.org/abs/2402.14903

https://www.beren.io/2024-07-07-Right-to-Left-Integer-Tokeni...

replies(1): >>44370681 #

3. krackers ◴[24 Jun 25 20:29 UTC] No.44370681[source]▶

>>44369975 #

GPT-2 can successfully learn to do multiplication using the standard tokenizer though, using "Implicit CoT with Stepwise Internalization".

https://twitter.com/yuntiandeng/status/1836114401213989366

If anything I'd think this indicates the barrier isn't tokenization (if it can do arithmetic, it can probably count as well) but something to do with "sequential dependencies" requiring use of COT and explicit training. Which still leaves me puzzled: there are tons of papers showing that variants of GPT-2 trained in the right way can do arithmetic, where are the papers solving the "count R in strawberry" problem?

4. ◴[24 Jun 25 21:04 UTC] No.44371050[source]▶

>>44369608 (TP) #

5. anonymoushn ◴[24 Jun 25 23:46 UTC] No.44372266[source]▶

>>44369608 (TP) #

There are various papers about this, maybe most prominently Byte-Latent Transformer.

↑