I'd like to see a math/logic bench appear for tokenization schemes that captures this. BPB/perplexity is fine, but its not everything.
I'd like to see a math/logic bench appear for tokenization schemes that captures this. BPB/perplexity is fine, but its not everything.
https://arxiv.org/abs/2402.14903
You right to left tokenize in groups of 3, so 1234567 becomes 1 234 567 rather than the default 123 456 7. And if you ensure all 1-3 digits groups are in the vocab, it does much better.
Both https://arxiv.org/abs/2503.13423 and https://arxiv.org/abs/2504.00178 (co-author) both independently noted that you can do this with just by modifying the pre-tokenization regex, without having to explicitly add commas.
Inthesamewaythatweusepunctuation. Or even that we usually order words a certain way, oranges and apples, Ted and Bill, roundabouts and swings.
The bitter lesson is that general methods and a system that learns trumps trying to manually embed/program human knowledge into the system, so clever architecture is ok and expected.
so basically reverse notation has the advantage of keeping magnitude of numbers (digits!) relative to each other constant (or at least anchored to the beginning of the number)
doesn't attention help with this? (or, it does help, but not much? or it falls out of autoregressive methods?)