I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.
Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.
Iteration speed trumps all in research, most of what Python does is launch GPU operations, if you're having slowdowns from Pythonland then you're doing something terribly wrong.
Python is an excellent (and yes, fast!) language for orchestrating and calling ML stuff. If C++ code is needed, call it as a module.