I've been playing with tokenization too. Starting from Kaparthy's Python minbpe I set myself the task of training a tokenizer on wikitext (500mb) in a reasonable time.
I got the C++ version down to about 50 minutes compared to the original Python code (estimated) several months.
Haven't really spent much time looking at encode and decode but I plan to incorporate these regex modifications when I do!