Most active commenters

mgraczyk(8)
cschmidt(3)
lukan(3)

Popular/hot comments

>>44373437 #
>>44374721 #
>>44375069 #

←back to thread

The bitter lesson is coming for tokenization

(lucalp.dev)

Show context

smeeth ◴[24 Jun 25 17:15 UTC] No.44368465[source]▶

>>44366494 (OP) #

The main limitation of tokenization is actually logical operations, including arithmetic. IIRC most of the poor performance of LLMs for math problems can be attributed to some very strange things that happen when you do math with tokens.

I'd like to see a math/logic bench appear for tokenization schemes that captures this. BPB/perplexity is fine, but its not everything.

replies(6): >>44368862 #>>44369438 #>>44371781 #>>44373480 #>>44374125 #>>44375446 #

1. cschmidt ◴[24 Jun 25 18:45 UTC] No.44369438[source]▶

>>44368465 #

This paper has a good solution:

https://arxiv.org/abs/2402.14903

You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.

Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.

replies(3): >>44372335 #>>44374721 #>>44374882 #

2. jvanderbot ◴[24 Jun 25 23:58 UTC] No.44372335[source]▶

>>44369438 (TP) #

Ok great! This is precisely how I chunk numbers for comparison. And not to diminish a solid result or the usefulness of it or the baseline tech: its clear that it we keep having to create situation - specific inputs or processes, we're not at AGI with this baseline tech

replies(1): >>44373437 #

3. chmod775 ◴[25 Jun 25 03:37 UTC] No.44373437[source]▶

>>44372335 #

> [..] we're not at AGI with this baseline tech

DAG architectures fundamentally cannot be AGI and you cannot even use them as a building block for a hypothetical AGI if they're immutable at runtime.

Any time I hear the goal being "AGI" in the context of these LLMs, I feel like listening to a bunch of 18th-century aristocrats trying to get to the moon by growing trees.

Try to create useful approximations using what you have or look for new approaches, but don't waste time on the impossible. There's no iterative improvements here that will get you to AGI.

replies(4): >>44373686 #>>44375069 #>>44376414 #>>44385536 #

4. kristjansson ◴[25 Jun 25 04:46 UTC] No.44373686{3}[source]▶

>>44373437 #

> "So... what does the thinking?"

> "You're not understanding, are you? The brain does the thinking. The meat."

> "Thinking meat! You're asking me to believe in thinking meat!"

https://www.mit.edu/people/dpolicar/writing/prose/text/think...

5. nielsole ◴[25 Jun 25 07:58 UTC] No.44374721[source]▶

>>44369438 (TP) #

Isn't that the opposite of the bitter lesson - adding more cleverness to the architecture?

replies(3): >>44376102 #>>44379736 #>>44381068 #

6. Y_Y ◴[25 Jun 25 08:23 UTC] No.44374882[source]▶

>>44369438 (TP) #

What do the vector space embeddings for digit strings even look like? Can you do arithmetic on them? If that's even desirable that it seems like you could just skip "embedding" altogether and intern all the numbers along one dimension.

7. mgraczyk ◴[25 Jun 25 08:57 UTC] No.44375069{3}[source]▶

>>44373437 #

This is meant to be some kind of Chinese room argument? Surely a 1e18 context window model running at 1e6 tokens per second could be AGI.

replies(3): >>44375232 #>>44375489 #>>44376558 #

8. lukan ◴[25 Jun 25 09:23 UTC] No.44375232{4}[source]▶

>>44375069 #

"Surely a 1e18 context window model running at 1e6 tokens per second could be AGI."

And why?

replies(1): >>44379407 #

9. chmod775 ◴[25 Jun 25 10:04 UTC] No.44375489{4}[source]▶

>>44375069 #

Personally I'm hoping for advancements that will eventually allow us to build vehicles capable of reaching the moon, but do keep me posted on those tree growing endeavors.

replies(1): >>44376291 #

10. cschmidt ◴[25 Jun 25 11:37 UTC] No.44376102[source]▶

>>44374721 #

I suppose it is. There is a lot to tokenization - pre-tokenization, how to handle digits, the tokenization training approach - that is about adding cleverness. In the long run, the bitter lesson would be to just get rid of it all and learn from more data. Many people would love to do it. But I think for the case of BLT, digits will still be an issue. There is no way an autoregressive entropy model will be able to split numbers sensibly, since it has no idea how many digits are coming. It seems like it will struggle more with arithmetic. Perhaps you could reverse all the digits in a number, then it has a chance. So 12334 becomes 43321, and it gets to start from the ones digit. This has been suggested as an approach for LLM's.

replies(2): >>44377308 #>>44379009 #

11. mgraczyk ◴[25 Jun 25 12:02 UTC] No.44376291{5}[source]▶

>>44375489 #

Tree growing?

And I don't follow, we've had vehicles capable of reaching the moon for over 55 years

replies(2): >>44376901 #>>44378970 #

12. AllegedAlec ◴[25 Jun 25 12:17 UTC] No.44376414{3}[source]▶

>>44373437 #

Thank you. It's maddening how people keep making this fundamental mistake.

13. rar00 ◴[25 Jun 25 12:33 UTC] No.44376558{4}[source]▶

>>44375069 #

This argument works better for state space models. A transformer would still steps context one token at a time, not maintain an internal 1e18 state.

replies(1): >>44379401 #

14. VonGallifrey ◴[25 Jun 25 13:08 UTC] No.44376901{6}[source]▶

>>44376291 #

Excuse me for the bad joke, but it seems like your context window was too small.

The Tree growing comment was a reference to another comment earlier in the comment chain.

replies(1): >>44379390 #

15. infogulch ◴[25 Jun 25 13:47 UTC] No.44377308{3}[source]▶

>>44376102 #

Little endian wins in the end.

16. anonymoushn ◴[25 Jun 25 16:15 UTC] No.44378970{6}[source]▶

>>44376291 #

It's about the immutability of the network at runtime. But I really don't think this is a big deal. General-purpose computers are immutable after they are manufactured, but can exhibit a variety of useful behaviors when supplied with different data. Human intelligence also doesn't rely on designing and manufacturing revised layouts for the nervous system (within a single human's lifetime, for use by that single human) to adapt to different settings. Is the level of mutability used by humans substantially more expressive than the limits of in-context learning? what about the limits of more unusual in-context learning techniques that are register-like, or that perform steps of gradient descent during inference? I don't know of a good argument that all of these techniques used in ML are fundamentally not expressive enough.

replies(1): >>44379734 #

17. pas ◴[25 Jun 25 16:19 UTC] No.44379009{3}[source]▶

>>44376102 #

... why does reversing the all the digits help? could you please explain it? many thanks!

replies(1): >>44386567 #

18. mgraczyk ◴[25 Jun 25 16:51 UTC] No.44379390{7}[source]▶

>>44376901 #

It's not a tree though

19. mgraczyk ◴[25 Jun 25 16:52 UTC] No.44379401{5}[source]▶

>>44376558 #

That doesn't matter, are you familiar with any theoretical results in which the computation is somehow limited in ways that practically matter when the context length is very long? I am not

20. mgraczyk ◴[25 Jun 25 16:52 UTC] No.44379407{5}[source]▶

>>44375232 #

Because that's quite a bit more information processing than any human brain

replies(1): >>44379674 #

21. lukan ◴[25 Jun 25 17:14 UTC] No.44379674{6}[source]▶

>>44379407 #

I don't think it is quantity that matters. Otherwise supercomputers are smart by definition.

replies(1): >>44379719 #

22. mgraczyk ◴[25 Jun 25 17:19 UTC] No.44379719{7}[source]▶

>>44379674 #

Well no, that's not what anyone is saying.

The claim was that it isn't possible in principle for "DAGs" or "immutable architectures" to be intelligent. That statement is confusing some theoretical results that aren't applicable to how LLMs work (output context is mutation).

I'm not claiming that compute makes the m intelligent. I'm pointing out that it is certainly possible, and at that level of compute it should be plausible. Feel free to share any theoretical results you think demonstrate the impossibility of "DAG" intelligence and are applicable

replies(1): >>44385233 #

23. mgraczyk ◴[25 Jun 25 17:21 UTC] No.44379734{7}[source]▶

>>44378970 #

LLMs, considered as a function of input and output, are not immutable at runtime. They create tokens that change the function when it is called again. That breaks most theoretical arguments

replies(1): >>44380342 #

24. fennecbutt ◴[25 Jun 25 17:21 UTC] No.44379736[source]▶

>>44374721 #

I guess it's just working with the brain model (so to speak) than against it.

Inthesamewaythatweusepunctuation. Or even that we usually order words a certain way, oranges and apples, Ted and Bill, roundabouts and swings.

25. anonymoushn ◴[25 Jun 25 18:20 UTC] No.44380342{8}[source]▶

>>44379734 #

Sure. Another view is that an LLM is an immutable function from document-prefixes to next-token distributions.

replies(1): >>44380391 #

26. mgraczyk ◴[25 Jun 25 18:24 UTC] No.44380391{9}[source]▶

>>44380342 #

But that view is wrong, the model outputs multiple tokens.

The right alternative view is that it's an immutable function from prefixes to a distribution over all possible sequences of tokens less than (context_len - prefix_len).

There are no mutable functions that cannot be viewed as immutable in a similar way. Human brains are an immutable function from input sense-data to the combination (brain adaptation, output actions). Here "brain adaptation" doing a lot of work, but so would be "1e18 output tokens". There is much more information contained within the latter

27. RaftPeople ◴[25 Jun 25 19:34 UTC] No.44381068[source]▶

>>44374721 #

> Isn't that the opposite of the bitter lesson - adding more cleverness to the architecture?

The bitter lesson is that general methods and a system that learns trumps trying to manually embed/program human knowledge into the system, so clever architecture is ok and expected.

28. lukan ◴[26 Jun 25 08:03 UTC] No.44385233{8}[source]▶

>>44379719 #

I am not saying it is impossible, I am saying it might be possible, but far from plausible with the current approach of LLMs in my experience with them.

29. munksbeer ◴[26 Jun 25 09:01 UTC] No.44385536{3}[source]▶

>>44373437 #

It doesn't feel particularly interesting to keep dismissing "these LLMs" as incapable of reaching AGI.

It feels more interesting to note that this time, it is different. I've been watching the field since the 90s when I first dabbled in crude neural nets. I am informed there was hype before, but in my time I've never seen progress like we've made in the last five years. If you showed it to people from the 90s, it would be mind blowing. And it keeps improving incrementally, and I do not think that is going to stop. The state of AI today is the worst it will ever be (trivially obvious but still capable of shocking me).

What I'm trying to say is that the shocking success of LLMs has become a powerful engine of progress, creating a positive feedback loop that is dramatically increasing investment, attracting top talent, and sharpening the focus of research into the next frontiers of artificial intelligence.

replies(1): >>44387575 #

30. cschmidt ◴[26 Jun 25 12:02 UTC] No.44386567{4}[source]▶

>>44379009 #

Math operations go right to left in the text, while we write them left to right. So if you see the digits 123... in an autoreressive manner, you don't know really anything, since it could be 12345 or 1234567. If you flipped 12345 as 543..., you know the place value of each. You know that the 5 you encounter first is in the ones place, the 4 is the tens place, etc. It gives the LLM a better chance of learning arithmetic.

replies(1): >>44401903 #

31. dTal ◴[26 Jun 25 14:07 UTC] No.44387575{4}[source]▶

>>44385536 #

>If you showed it to people from the 90s, it would be mind blowing

90's? It's mind blowing to me now.

My daily driver laptop is (internally) a Thinkpad T480, a very middle of the road business class laptop from 2018.

It now talks to me. Usually knowledgeably, in a variety of common languages, using software I can download and run for free. It understands human relationships and motivations. It can offer reasonably advice and write simple programs from a description. It notices my tone and tries to adapt its manner.

All of this was inconceivable when I bought the laptop - I would have called it very unrealistic sci-fi. I am trying not to forget that.

32. pas ◴[28 Jun 25 02:14 UTC] No.44401903{5}[source]▶

>>44386567 #

ah, okay, thanks!

so basically reverse notation has the advantage of keeping magnitude of numbers (digits!) relative to each other constant (or at least anchored to the beginning of the number)

doesn't attention help with this? (or, it does help, but not much? or it falls out of autoregressive methods?)

↑