(github.com)

279 points matthewolfe | 3 comments | 30 Jun 25 12:33 UTC | HN request time: 0.489s | source

TokenDagger is a drop-in replacement for OpenAI’s Tiktoken (the tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It’s written in C++ 17 with thin Python bindings, keeps the exact same BPE vocab/special-token rules, and focuses on raw speed.

I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.

Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.

1. kevmo314 ◴[30 Jun 25 14:56 UTC] No.44424134[source]▶

>>44422480 (OP) #

Nice work! I tried something similar a while back ago: https://github.com/kevmo314/tokie

The takeaway I also found was that the running cost was really dominated by pretokenization (the regex). It's cool to see that you found a faster way to run the regex, but have you tried comparing the performance of just swapping out the regex engine and leaving the actual BPE to tiktoken? I wonder if that is upstreamable?

replies(2): >>44424322 #>>44439086 #

2. matthewolfe ◴[30 Jun 25 15:12 UTC] No.44424322[source]▶

>>44424134 (TP) #

Cool!

I've reached out to the guy who maintains Tiktoken to talk about this.

3. 22c ◴[02 Jul 25 00:05 UTC] No.44439086[source]▶

>>44424134 (TP) #

There is at least some awareness already when it comes to the performance of the regex engine:

https://github.com/openai/tiktoken/blob/main/src/lib.rs#L95-...

↑

Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken