(lucalp.dev)

296 points todsacerdoti | 1 comments | 24 Jun 25 14:14 UTC | HN request time: 0s | source

Show context

smeeth ◴[24 Jun 25 17:15 UTC] No.44368465[source]▶

The main limitation of tokenization is actually logical operations, including arithmetic. IIRC most of the poor performance of LLMs for math problems can be attributed to some very strange things that happen when you do math with tokens.

I'd like to see a math/logic bench appear for tokenization schemes that captures this. BPB/perplexity is fine, but its not everything.

replies(6): >>44368862 #>>44369438 #>>44371781 #>>44373480 #>>44374125 #>>44375446 #

search_facility ◴[24 Jun 25 22:37 UTC] No.44371781[source]▶

>>44368465 #

regarding “math with tokens”: There was paper with tokenization that has specific tokens for int numbers, where token value = number. model learned to work with numbers as numbers and with tokens for everything else... it was good at math. can’t find a link, was on hugginface papers

replies(1): >>44372446 #

samus ◴[25 Jun 25 00:16 UTC] No.44372446[source]▶

>>44371781 #

Shouldn't production models already do this? They already tend to use tokenizers with complex rules to deal with a lot of input that would otherwise be tokenized in a suboptimal way. I recall a bug in an inference engine (maybe llama.cpp?) because of an implementation difference in their regex engine compared to the model trainer. Which means that the tokenizer used regex-based rules to chop up the input.

replies(1): >>44381905 #

1. search_facility ◴[25 Jun 25 21:15 UTC] No.44381905[source]▶

>>44372446 #

turns out - no, by intuition they should do this for sure - but no.

UPD: Found the paper: - https://huggingface.co/papers/2502.09741 - https://fouriernumber.github.io/

in paper mentioned “number” is a single sort-of “token” with numeric value, so network dealing with numbers like real numbers, separately from char representation. All the math happens directly on “number value”. In majority of current models numbers are handled like sequences of chars

↑

The bitter lesson is coming for tokenization