Teuken-7B-Base and Teuken-7B-Instruct: Towards European LLMs (2024)

(arxiv.org)

248 points doener | 1 comments | 15 Apr 25 10:17 UTC | HN request time: 0.21s | source

Show context

ozgune ◴[15 Apr 25 12:12 UTC] No.43691597[source]▶

I had a related, but orthogonal question about multilingual LLMs.

When I ask smaller models a question in English, the model does well. When I ask the same model a question in Turkish, the answer is mediocre. When I ask the model to translate my question into English, get the answer, and translate the answer back to Turkish, the model again does well.

For example, I tried the above with Llama 3.3 70B, and asked it to plan me a 3-day trip to Istanbul. When I asked Llama to do the translations between English <> Turkish, the answer was notably better.

Anyone else observed a similar behavior?

replies(11): >>43691620 #>>43691751 #>>43691774 #>>43692427 #>>43692596 #>>43692803 #>>43692874 #>>43693906 #>>43695475 #>>43698229 #>>43698667 #

spacebanana7 ◴[15 Apr 25 12:29 UTC] No.43691774[source]▶

>>43691597 #

I suspect this also happens in programming languages. Subjectively I get the feeling that LLMs prefer to write in Python or JS.

Would be interesting to see whether they actually score better in leetcode questions when using python.

replies(2): >>43692051 #>>43692925 #

beAbU ◴[15 Apr 25 13:00 UTC] No.43692051[source]▶

>>43691774 #

Based on my very very limited understanding of how LLMs work, surely they don't "prefer" anything, and just use what they have been trained on?

Presumably there is a lot more public info about, and code in Javascript and Python, hence this "preference"

Maybe the LLM preferring English is because of a similar phenomenon - it has been trained on mostly western, English speaking internet?

replies(2): >>43692287 #>>43692991 #

spacebanana7 ◴[15 Apr 25 13:19 UTC] No.43692287[source]▶

>>43692051 #

> Presumably there is a lot more public info about, and code in Javascript and Python, hence this "preference"

This likely plays a major - probably dominant - role.

It's interesting to think of other factors too though. The relatively concise syntax of those languages might make them easier for LLMs to work with. If resources are in any way token limited then reading and writing Spring Boot apps is going to be burdensome.

Those languages also have a lot of single file applications, which might make them easier for LLMs to learn. So much of iOS development for example is split across many files and I wonder if that affects the quality of the training data.

replies(1): >>43692673 #

1. idle_zealot ◴[15 Apr 25 13:50 UTC] No.43692673[source]▶

>>43692287 #

Also worth considering: there's a wider range of "acceptable" output programs when dealing with such forgiving scripting languages. If asked to output C then there are loads of finicky bits it could mess up, pointer accesses, writing past the end of an array, using uninitialized memory, using a value it already freed, missing a free, etc. All things that the language runtime handles in Python or JS. There's a higher cognitive load it needs to take on.

↑