←back to thread

279 points matthewolfe | 6 comments | | HN request time: 0.363s | source | bottom

TokenDagger is a drop-in replacement for OpenAI’s Tiktoken (the tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It’s written in C++ 17 with thin Python bindings, keeps the exact same BPE vocab/special-token rules, and focuses on raw speed.

I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.

Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.

Show context
npalli ◴[] No.44422888[source]
Kudos, I think (in the short term at least) there is a large amount of perf. optimization to be found by coding parts of the whole AI/ML infrastructure in C++ like this one, not as a rewrite (god no!) but drop in and fix key bottlenecks. Anytime I see someone (seems Chinese engineers are good at this) put something out in C++, good chance some solid engineering tradeoffs have been made and dramatic improvement will be seen.
replies(4): >>44424382 #>>44424572 #>>44424990 #>>44427963 #
1. saretup ◴[] No.44424572[source]
And while we’re at it, let’s move away from Python altogether. In the long run it doesn’t make sense just because it’s the language ML engineers are familiar with.
replies(3): >>44424608 #>>44425185 #>>44425265 #
2. tbalsam ◴[] No.44424608[source]
No! This is not good.

Iteration speed trumps all in research, most of what Python does is launch GPU operations, if you're having slowdowns from Pythonland then you're doing something terribly wrong.

Python is an excellent (and yes, fast!) language for orchestrating and calling ML stuff. If C++ code is needed, call it as a module.

3. bigyabai ◴[] No.44425185[source]
It makes plenty of sense. Python handles strings well, has a great package ecosystem, and is easy to write/learn for non-programmers. It can be easily embedded into a notebook (which is huge for academics) and is technically a "write once run anywhere" platform in theory. It's great.

If you think Python is a bad language for AI integrations, try writing one in a compiled language.

replies(1): >>44429751 #
4. janalsncm ◴[] No.44425265[source]
Most of that is already happening under the hood. A lot of performance-sensitive code is already written in C or cython. For example numpy, scikit learn, pandas. Lots of torch code is either C or CUDA.

ML researchers aren’t using python because they are dumb. They use it because what takes 8 lines in Java can be done with 2 or 3 (including import json) in python for example.

5. mdaniel ◴[] No.44429751[source]
> has a great package ecosystem

So great there are 8 of them. 800% better than all the rest!

> If you think Python is a bad language for AI integrations, try writing one in a compiled language.

I'll take this challenge, all day, every day, so long as I and the hypothetical 'move fast and break things' have equal "must run in prod" and "must be understandable by some other human" qualifiers

What type is `array`? Don't worry your pretty head about it, feed it whatever type you want and let Sentry's TypeError sort it out <https://github.com/openai/whisper/blob/v20250625/whisper/aud...> Oh, sorry, and you wanted to know what `pad_or_trim` returns? Well that's just, like, your opinion man

replies(1): >>44429894 #
6. bigyabai ◴[] No.44429894{3}[source]
Tracks with me, I don't like using Python for real programming. Try explaining any of your "Python sucks" catechisms to a second-year statistics student though. If you'd rather teach them C++, be my guest. If you want to make them indebted to proprietary infra like Mojo or CUDA, knock yourself out.

I'm still teaching them Python.