The bitter lesson is coming for tokenization

(lucalp.dev)

296 points todsacerdoti | 1 comments | 24 Jun 25 14:14 UTC | HN request time: 0.39s | source

Show context

smeeth ◴[24 Jun 25 17:15 UTC] No.44368465[source]▶

The main limitation of tokenization is actually logical operations, including arithmetic. IIRC most of the poor performance of LLMs for math problems can be attributed to some very strange things that happen when you do math with tokens.

I'd like to see a math/logic bench appear for tokenization schemes that captures this. BPB/perplexity is fine, but its not everything.

replies(6): >>44368862 #>>44369438 #>>44371781 #>>44373480 #>>44374125 #>>44375446 #

cschmidt ◴[24 Jun 25 18:45 UTC] No.44369438[source]▶

>>44368465 #

This paper has a good solution:

https://arxiv.org/abs/2402.14903

You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.

Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.

replies(3): >>44372335 #>>44374721 #>>44374882 #

nielsole ◴[25 Jun 25 07:58 UTC] No.44374721[source]▶

>>44369438 #

Isn't that the opposite of the bitter lesson - adding more cleverness to the architecture?

replies(3): >>44376102 #>>44379736 #>>44381068 #

cschmidt ◴[25 Jun 25 11:37 UTC] No.44376102[source]▶

>>44374721 #

I suppose it is. There is a lot to tokenization - pre-tokenization, how to handle digits, the tokenization training approach - that is about adding cleverness. In the long run, the bitter lesson would be to just get rid of it all and learn from more data. Many people would love to do it. But I think for the case of BLT, digits will still be an issue. There is no way an autoregressive entropy model will be able to split numbers sensibly, since it has no idea how many digits are coming. It seems like it will struggle more with arithmetic. Perhaps you could reverse all the digits in a number, then it has a chance. So 12334 becomes 43321, and it gets to start from the ones digit. This has been suggested as an approach for LLM's.

replies(2): >>44377308 #>>44379009 #

1. infogulch ◴[25 Jun 25 13:47 UTC] No.44377308[source]▶

>>44376102 #

Little endian wins in the end.

↑