Teuken-7B-Base and Teuken-7B-Instruct: Towards European LLMs (2024)

For most low-resource languages, support in LLMs is trained through translation pairs between english and the other languages, because translation data is easier to come across than say, conversations about coding, history, physics, basically the kind of data that is usually used for instruct training.

This kind of training data typically looks like ChatGPT style conversations where all the prompts are all templated like “Translate the following text from X to Y: [text]” and the LLM’s expected answer is the translated text.

LLMs can generalize through transfer learning (to a certain extent) from these translation pairs to some understanding (strong) and even answering (weak) in the target language. It also means that the LLM’s actual sweet spot is in translation itself since that’s what was trained in, not just a generalization.