Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken

(github.com)

279 points matthewolfe | 3 comments | 30 Jun 25 12:33 UTC | HN request time: 0.623s | source

TokenDagger is a drop-in replacement for OpenAI’s Tiktoken (the tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It’s written in C++ 17 with thin Python bindings, keeps the exact same BPE vocab/special-token rules, and focuses on raw speed.

I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.

Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.

Show context

npalli ◴[30 Jun 25 13:13 UTC] No.44422888[source]▶

>>44422480 (OP) #

Kudos, I think (in the short term at least) there is a large amount of perf. optimization to be found by coding parts of the whole AI/ML infrastructure in C++ like this one, not as a rewrite (god no!) but drop in and fix key bottlenecks. Anytime I see someone (seems Chinese engineers are good at this) put something out in C++, good chance some solid engineering tradeoffs have been made and dramatic improvement will be seen.

replies(4): >>44424382 #>>44424572 #>>44424990 #>>44427963 #

saretup ◴[30 Jun 25 15:35 UTC] No.44424572[source]▶

>>44422888 #

And while we’re at it, let’s move away from Python altogether. In the long run it doesn’t make sense just because it’s the language ML engineers are familiar with.

replies(3): >>44424608 #>>44425185 #>>44425265 #

1. bigyabai ◴[30 Jun 25 16:33 UTC] No.44425185[source]▶

>>44424572 #

It makes plenty of sense. Python handles strings well, has a great package ecosystem, and is easy to write/learn for non-programmers. It can be easily embedded into a notebook (which is huge for academics) and is technically a "write once run anywhere" platform in theory. It's great.

If you think Python is a bad language for AI integrations, try writing one in a compiled language.

replies(1): >>44429751 #

2. mdaniel ◴[01 Jul 25 01:37 UTC] No.44429751[source]▶

>>44425185 (TP) #

> has a great package ecosystem

So great there are 8 of them. 800% better than all the rest!

> If you think Python is a bad language for AI integrations, try writing one in a compiled language.

I'll take this challenge, all day, every day, so long as I and the hypothetical 'move fast and break things' have equal "must run in prod" and "must be understandable by some other human" qualifiers

What type is `array`? Don't worry your pretty head about it, feed it whatever type you want and let Sentry's TypeError sort it out <https://github.com/openai/whisper/blob/v20250625/whisper/aud...> Oh, sorry, and you wanted to know what `pad_or_trim` returns? Well that's just, like, your opinion man

replies(1): >>44429894 #

3. bigyabai ◴[01 Jul 25 02:02 UTC] No.44429894[source]▶

>>44429751 #

Tracks with me, I don't like using Python for real programming. Try explaining any of your "Python sucks" catechisms to a second-year statistics student though. If you'd rather teach them C++, be my guest. If you want to make them indebted to proprietary infra like Mojo or CUDA, knock yourself out.

I'm still teaching them Python.

↑