(ethz.ch)

505 points andy99 | 2 comments | 11 Jul 25 18:45 UTC | HN request time: 0.484s | source

1. seydor ◴[12 Jul 25 05:38 UTC] No.44539596[source]▶

I wonder if multilingual llms are better or worse compared a single language model

2. tugdual ◴[12 Jul 25 06:19 UTC] No.44539787[source]▶

This is an interesting problem that has various challenges - currently most tokenization solutions where trainees using hype pair encoding where the most commonly seen combinations of letters were being selected to be a mapping. This meant that the majority of tokenization was English mappings meaning your LLM had a better tokenization of English compared to other languages it was being trained on.

C.f. https://medium.com/@biswanai92/understanding-token-fertility...

↑

ETH Zurich and EPFL to release a LLM developed on public infrastructure