I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.
Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.
GPU kernels typically dominate in terms of wall clock time, the only exception might be very small models.
Thus the latency of tokenization can essentially be “hidden”, by having the CPU prepare the next batch while the GPU finishes the current batch.
Sometimes it can overlap with accelerator issue, but pros look at flame graphs: a CPU core running the AVX lanes hard isn't keeping the bus fed, million things. People pre-tokenize big runs all the time.
I don't know why this thread is full of "nothing to see here", this obliterates the SOTA from the money is no object status quo: I'd like to think better of the community than the obvious which is that C++ is threatening a modest mindshare comeback against a Rust narrative that's already under pressure from the explosion of interest in Zig. Maybe there's a better reason.