(github.com)

279 points matthewolfe | 1 comments | 30 Jun 25 12:33 UTC | HN request time: 0s | source

TokenDagger is a drop-in replacement for OpenAI’s Tiktoken (the tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It’s written in C++ 17 with thin Python bindings, keeps the exact same BPE vocab/special-token rules, and focuses on raw speed.

I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.

Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.

Show context

pama ◴[30 Jun 25 14:08 UTC] No.44423489[source]▶

>>44422480 (OP) #

Cool. Would it be possible to eliminate that little vocab format conversion requirement for the vocab I see in the test against tiktoken? It would be nice to have a fully compatible drop in replacement without having to think about details. It also would be nice to have examples that work the other way around: initialize tiktoken as you normally would, including any specialized extension of standard tokenizers, and then use that initialized tokenizer to initialize a new tokendagger and test identity of results.

replies(2): >>44425102 #>>44425489 #

1. matthewolfe ◴[30 Jun 25 16:25 UTC] No.44425102[source]▶

>>44423489 #

Ah good catch. Updating this right now.

↑

Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken