←back to thread

505 points andy99 | 2 comments | | HN request time: 0.484s | source
1. seydor ◴[] No.44539596[source]
I wonder if multilingual llms are better or worse compared a single language model
replies(1): >>44539787 #
2. tugdual ◴[] No.44539787[source]
This is an interesting problem that has various challenges - currently most tokenization solutions where trainees using hype pair encoding where the most commonly seen combinations of letters were being selected to be a mapping. This meant that the majority of tokenization was English mappings meaning your LLM had a better tokenization of English compared to other languages it was being trained on.

C.f. https://medium.com/@biswanai92/understanding-token-fertility...