I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.
Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.
Out of the large proprietary western AI labs (OpenAI, Anthropic, Google), only Anthropic with Claude 3 and newer lack local tokenizers.
[1] https://github.com/google/sentencepiece
[2] https://github.com/googleapis/python-aiplatform/blob/main/ve...
[3] https://storage.googleapis.com/deepmind-media/gemma/gemma-2-...: "We inherit from the large Gemini vocabulary (256k entries)."
[4] https://storage.googleapis.com/deepmind-media/gemma/Gemma3Re...: "We use the same tokenizer as Gemini 2.0."
For this project, I think there is value in building some of these model-specific quirks into the library. Could see some minor performance gains and generally make it easier to integrate with. It's probably not too much work to keep up with newer models. Tokenizers change much less frequently.
[0] https://github.com/meta-llama/llama-models/blob/01dc8ce46fec...
[1] https://github.com/mistralai/mistral-common/tree/main/src/mi...