←back to thread

279 points matthewolfe | 1 comments | | HN request time: 0.208s | source

TokenDagger is a drop-in replacement for OpenAI’s Tiktoken (the tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It’s written in C++ 17 with thin Python bindings, keeps the exact same BPE vocab/special-token rules, and focuses on raw speed.

I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.

Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.

1. justinhj ◴[] No.44439670[source]
I've been playing with tokenization too. Starting from Kaparthy's Python minbpe I set myself the task of training a tokenizer on wikitext (500mb) in a reasonable time. I got the C++ version down to about 50 minutes compared to the original Python code (estimated) several months.

Haven't really spent much time looking at encode and decode but I plan to incorporate these regex modifications when I do!

https://github.com/justinhj/minbpe-cc