The bitter lesson is coming for tokenization

(lucalp.dev)

296 points todsacerdoti | 3 comments | 24 Jun 25 14:14 UTC | HN request time: 0.788s | source

Show context

smeeth ◴[24 Jun 25 17:15 UTC] No.44368465[source]▶

The main limitation of tokenization is actually logical operations, including arithmetic. IIRC most of the poor performance of LLMs for math problems can be attributed to some very strange things that happen when you do math with tokens.

I'd like to see a math/logic bench appear for tokenization schemes that captures this. BPB/perplexity is fine, but its not everything.

replies(6): >>44368862 #>>44369438 #>>44371781 #>>44373480 #>>44374125 #>>44375446 #

calibas ◴[24 Jun 25 17:52 UTC] No.44368862[source]▶

>>44368465 #

It's a non-deterministic language model, shouldn't we expect mediocre performance in math? It seems like the wrong tool for the job...

replies(4): >>44368958 #>>44368999 #>>44369121 #>>44372463 #

rictic ◴[24 Jun 25 18:19 UTC] No.44369121[source]▶

>>44368862 #

Models are deterministic, they're a mathematical function from sequences of tokens to probability distributions over the next token.

Then a system samples from that distribution, typically with randomness, and there are some optimizations in running them that introduce randomness, but it's important to understand that the models themselves are not random.

replies(2): >>44369860 #>>44370679 #

geysersam ◴[24 Jun 25 20:29 UTC] No.44370679[source]▶

>>44369121 #

The LLMs are deterministic but they only return a probability distribution over following tokens. The tokens the user sees in the response are selected by some typically stochastic sampling procedure.

replies(1): >>44371710 #

1. danielmarkbruce ◴[24 Jun 25 22:28 UTC] No.44371710[source]▶

>>44370679 #

Assuming decent data, it won't be stochastic sampling for many math operations/input combinations. When people suggest LLMs with tokenization could learn math, they aren't suggesting a small undertrained model trained on crappy data.

replies(1): >>44372243 #

2. anonymoushn ◴[24 Jun 25 23:41 UTC] No.44372243[source]▶

>>44371710 (TP) #

I mean, this depends on your sampler. With temp=1 and sampling from the raw output distribution, setting aside numerics issues, these models output nonzero probability of every token at each position

replies(1): >>44380103 #

3. danielmarkbruce ◴[25 Jun 25 17:58 UTC] No.44380103[source]▶

>>44372243 #

A large model well trained on good data will have logits so negative for something like "1+1=" -> 3 that they won't come up in practice unless you sample in a way to deliberately misuse the model.

↑