Teuken-7B-Base and Teuken-7B-Instruct: Towards European LLMs (2024)

(arxiv.org)

248 points doener | 2 comments | 15 Apr 25 10:17 UTC | HN request time: 0.591s | source

Show context

miros_love ◴[15 Apr 25 12:14 UTC] No.43691616[source]▶

>European versions of ARC

But this is an image-like benchmark. Has anyone looked at the article about the EU-ARC, what is the difference? Why can't you measure it on a regular one?

I glanced through it, didn't find it right away, but judging by their tokenizer, they are learning from scratch. In general, I don't like this approach for the task at hand. For large languages, there are already good models that they don't want to compare with. And for low-resource languages, it is very important to take more languages from this language group, which are not necessarily part of the EU

replies(2): >>43691644 #>>43691647 #

Etheryte ◴[15 Apr 25 12:17 UTC] No.43691647[source]▶

>>43691616 #

Why would they want more languages from outside of the EU when they've clearly stated they only target the 24 official languages of the European Union?

replies(1): >>43691728 #

miros_love ◴[15 Apr 25 12:24 UTC] No.43691728[source]▶

>>43691647 #

For example: Slovene language. You simply don't have enough data on it. But if you add all the data that is available on related languages, you will get a higher quality. LLM fails with this property for low-resource languages.

replies(2): >>43691822 #>>43692263 #

yorwba ◴[15 Apr 25 12:35 UTC] No.43691822[source]▶

>>43691728 #

They train on 14 billion tokens in Slovene. Are you sure that's not enough?

replies(1): >>43692048 #

1. miros_love ◴[15 Apr 25 13:00 UTC] No.43692048[source]▶

>>43691822 #

Unfortunately, yes.

We need more tokens, more variety of topics in texts and more complexity.

replies(1): >>43693189 #

2. mdp2021 ◴[15 Apr 25 14:23 UTC] No.43693189[source]▶

>>43692048 (TP) #

We need one-shot learning.

(That amount is equivalent to 50000 books, which few nationals will have read.)

↑