The bitter lesson is coming for tokenization

(lucalp.dev)

Show context

smeeth ◴[24 Jun 25 17:15 UTC] No.44368465[source]▶

The main limitation of tokenization is actually logical operations, including arithmetic. IIRC most of the poor performance of LLMs for math problems can be attributed to some very strange things that happen when you do math with tokens.

I'd like to see a math/logic bench appear for tokenization schemes that captures this. BPB/perplexity is fine, but its not everything.

replies(6): >>44368862 #>>44369438 #>>44371781 #>>44373480 #>>44374125 #>>44375446 #

cschmidt ◴[24 Jun 25 18:45 UTC] No.44369438[source]▶

>>44368465 #

This paper has a good solution:

https://arxiv.org/abs/2402.14903

You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.

Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.

replies(3): >>44372335 #>>44374721 #>>44374882 #

jvanderbot ◴[24 Jun 25 23:58 UTC] No.44372335[source]▶

>>44369438 #

Ok great! This is precisely how I chunk numbers for comparison. And not to diminish a solid result or the usefulness of it or the baseline tech: its clear that it we keep having to create situation - specific inputs or processes, we're not at AGI with this baseline tech

replies(1): >>44373437 #

chmod775 ◴[25 Jun 25 03:37 UTC] No.44373437[source]▶

>>44372335 #

> [..] we're not at AGI with this baseline tech

DAG architectures fundamentally cannot be AGI and you cannot even use them as a building block for a hypothetical AGI if they're immutable at runtime.

Any time I hear the goal being "AGI" in the context of these LLMs, I feel like listening to a bunch of 18th-century aristocrats trying to get to the moon by growing trees.

Try to create useful approximations using what you have or look for new approaches, but don't waste time on the impossible. There's no iterative improvements here that will get you to AGI.

replies(4): >>44373686 #>>44375069 #>>44376414 #>>44385536 #

mgraczyk ◴[25 Jun 25 08:57 UTC] No.44375069{4}[source]▶

>>44373437 #

This is meant to be some kind of Chinese room argument? Surely a 1e18 context window model running at 1e6 tokens per second could be AGI.

replies(3): >>44375232 #>>44375489 #>>44376558 #

chmod775 ◴[25 Jun 25 10:04 UTC] No.44375489{5}[source]▶

>>44375069 #

Personally I'm hoping for advancements that will eventually allow us to build vehicles capable of reaching the moon, but do keep me posted on those tree growing endeavors.

replies(1): >>44376291 #

1. mgraczyk ◴[25 Jun 25 12:02 UTC] No.44376291{6}[source]▶

>>44375489 #

Tree growing?

And I don't follow, we've had vehicles capable of reaching the moon for over 55 years

replies(2): >>44376901 #>>44378970 #

2. VonGallifrey ◴[25 Jun 25 13:08 UTC] No.44376901[source]▶

>>44376291 (TP) #

Excuse me for the bad joke, but it seems like your context window was too small.

The Tree growing comment was a reference to another comment earlier in the comment chain.

replies(1): >>44379390 #

3. anonymoushn ◴[25 Jun 25 16:15 UTC] No.44378970[source]▶

>>44376291 (TP) #

It's about the immutability of the network at runtime. But I really don't think this is a big deal. General-purpose computers are immutable after they are manufactured, but can exhibit a variety of useful behaviors when supplied with different data. Human intelligence also doesn't rely on designing and manufacturing revised layouts for the nervous system (within a single human's lifetime, for use by that single human) to adapt to different settings. Is the level of mutability used by humans substantially more expressive than the limits of in-context learning? what about the limits of more unusual in-context learning techniques that are register-like, or that perform steps of gradient descent during inference? I don't know of a good argument that all of these techniques used in ML are fundamentally not expressive enough.

replies(1): >>44379734 #

4. mgraczyk ◴[25 Jun 25 16:51 UTC] No.44379390[source]▶

>>44376901 #

It's not a tree though

5. mgraczyk ◴[25 Jun 25 17:21 UTC] No.44379734[source]▶

>>44378970 #

LLMs, considered as a function of input and output, are not immutable at runtime. They create tokens that change the function when it is called again. That breaks most theoretical arguments

replies(1): >>44380342 #

6. anonymoushn ◴[25 Jun 25 18:20 UTC] No.44380342{3}[source]▶

>>44379734 #

Sure. Another view is that an LLM is an immutable function from document-prefixes to next-token distributions.

replies(1): >>44380391 #

7. mgraczyk ◴[25 Jun 25 18:24 UTC] No.44380391{4}[source]▶

>>44380342 #

But that view is wrong, the model outputs multiple tokens.

The right alternative view is that it's an immutable function from prefixes to a distribution over all possible sequences of tokens less than (context_len - prefix_len).

There are no mutable functions that cannot be viewed as immutable in a similar way. Human brains are an immutable function from input sense-data to the combination (brain adaptation, output actions). Here "brain adaptation" doing a lot of work, but so would be "1e18 output tokens". There is much more information contained within the latter

↑