(github.com)

279 points matthewolfe | 3 comments | 30 Jun 25 12:33 UTC | HN request time: 0.375s | source

TokenDagger is a drop-in replacement for OpenAI’s Tiktoken (the tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It’s written in C++ 17 with thin Python bindings, keeps the exact same BPE vocab/special-token rules, and focuses on raw speed.

I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.

Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.

1. silentsea90 ◴[30 Jun 25 15:38 UTC] No.44424606[source]▶

>>44422480 (OP) #

"I’m teaching myself LLM internals by re-implementing the stack from first principles." - curious what resources you're using? Any books or courses, or just building it straight up? Great work!

replies(2): >>44424836 #>>44424942 #

2. ◴[30 Jun 25 16:01 UTC] No.44424836[source]▶

>>44424606 (TP) #

3. matthewolfe ◴[30 Jun 25 16:09 UTC] No.44424942[source]▶

>>44424606 (TP) #

Modal's GPU glossary is a good overview about how GPUs work [0]. Karpathy's LLM overview is a good high level overview on LLMs [1]. 3b1b's video (and subsequent videos) on transformers was excellent at helping me understand the math at a high level [2]. This matrix multiplication optimization worklog helped me understand writing better CUDA (not for beginner intro though) [3].

During this process I also asked ChatGPT a lot of questions.

I'm definitely open to suggestions about "how to learn" with all the new tools we have. I felt this has not been straightforward to figure out.

[0] https://modal.com/gpu-glossary

[1] https://www.youtube.com/watch?v=7xTGNNLPyMI

[2] https://www.youtube.com/watch?v=wjZofJX0v4M

[3] https://siboehm.com/articles/22/CUDA-MMM

↑

Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken