I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.
Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.
1. Make it work. 2. Make it fast. 3. Make it pretty.
Transformers & LLMs have been developed to a point where they work quite well. I feel as though we're at a stage where most substantial progress is being made on the performance side.
My mentor used say it is the difference between a screw and glue.
You can glue some things together and prove that it works, but eventually you learn that anytime you had to break something to fix it, you should've used a screw.
It is trade off in coupling - the glue binds tightly over the entire surface but a screw concentrates the loads, so needs maintenance to stay tight.
You only really know which is "right" it if you test it to destruction.
All of that advice is probably sounding date now, even in material science the glue might be winning (see the Tesla bumper or Lotus Elise bonding videos - every screw is extra grams).
1. Make it
2. Make it work
3. Make it work better
(different circumstances have different nuances about what "better" means, it isn't always performance optimization; some do substitute "faster" for "better" here, but I think it loses generality then).Firmitas, utilitas, venustas - Strong, useful, and beautiful.