The bitter lesson is coming for tokenization

(lucalp.dev)

Show context

smeeth ◴[24 Jun 25 17:15 UTC] No.44368465[source]▶

The main limitation of tokenization is actually logical operations, including arithmetic. IIRC most of the poor performance of LLMs for math problems can be attributed to some very strange things that happen when you do math with tokens.

I'd like to see a math/logic bench appear for tokenization schemes that captures this. BPB/perplexity is fine, but its not everything.

replies(6): >>44368862 #>>44369438 #>>44371781 #>>44373480 #>>44374125 #>>44375446 #

cschmidt ◴[24 Jun 25 18:45 UTC] No.44369438[source]▶

>>44368465 #

This paper has a good solution:

https://arxiv.org/abs/2402.14903

You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.

Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.

replies(3): >>44372335 #>>44374721 #>>44374882 #

1. nielsole ◴[25 Jun 25 07:58 UTC] No.44374721[source]▶

>>44369438 #

Isn't that the opposite of the bitter lesson - adding more cleverness to the architecture?

replies(3): >>44376102 #>>44379736 #>>44381068 #

2. cschmidt ◴[25 Jun 25 11:37 UTC] No.44376102[source]▶

>>44374721 (TP) #

I suppose it is. There is a lot to tokenization - pre-tokenization, how to handle digits, the tokenization training approach - that is about adding cleverness. In the long run, the bitter lesson would be to just get rid of it all and learn from more data. Many people would love to do it. But I think for the case of BLT, digits will still be an issue. There is no way an autoregressive entropy model will be able to split numbers sensibly, since it has no idea how many digits are coming. It seems like it will struggle more with arithmetic. Perhaps you could reverse all the digits in a number, then it has a chance. So 12334 becomes 43321, and it gets to start from the ones digit. This has been suggested as an approach for LLM's.

replies(2): >>44377308 #>>44379009 #

3. infogulch ◴[25 Jun 25 13:47 UTC] No.44377308[source]▶

>>44376102 #

Little endian wins in the end.

4. pas ◴[25 Jun 25 16:19 UTC] No.44379009[source]▶

>>44376102 #

... why does reversing the all the digits help? could you please explain it? many thanks!

replies(1): >>44386567 #

5. fennecbutt ◴[25 Jun 25 17:21 UTC] No.44379736[source]▶

>>44374721 (TP) #

I guess it's just working with the brain model (so to speak) than against it.

Inthesamewaythatweusepunctuation. Or even that we usually order words a certain way, oranges and apples, Ted and Bill, roundabouts and swings.

6. RaftPeople ◴[25 Jun 25 19:34 UTC] No.44381068[source]▶

>>44374721 (TP) #

> Isn't that the opposite of the bitter lesson - adding more cleverness to the architecture?

The bitter lesson is that general methods and a system that learns trumps trying to manually embed/program human knowledge into the system, so clever architecture is ok and expected.

7. cschmidt ◴[26 Jun 25 12:02 UTC] No.44386567{3}[source]▶

>>44379009 #

Math operations go right to left in the text, while we write them left to right. So if you see the digits 123... in an autoreressive manner, you don't know really anything, since it could be 12345 or 1234567. If you flipped 12345 as 543..., you know the place value of each. You know that the 5 you encounter first is in the ones place, the 4 is the tens place, etc. It gives the LLM a better chance of learning arithmetic.

replies(1): >>44401903 #

8. pas ◴[28 Jun 25 02:14 UTC] No.44401903{4}[source]▶

>>44386567 #

ah, okay, thanks!

so basically reverse notation has the advantage of keeping magnitude of numbers (digits!) relative to each other constant (or at least anchored to the beginning of the number)

doesn't attention help with this? (or, it does help, but not much? or it falls out of autoregressive methods?)

↑