This is an interesting problem that has various challenges - currently most tokenization solutions where trainees using hype pair encoding where the most commonly seen combinations of letters were being selected to be a mapping. This meant that the majority of tokenization was English mappings meaning your LLM had a better tokenization of English compared to other languages it was being trained on.
C.f. https://medium.com/@biswanai92/understanding-token-fertility...