Teuken-7B-Base and Teuken-7B-Instruct: Towards European LLMs (2024)

(arxiv.org)

Show context

ozgune ◴[15 Apr 25 12:12 UTC] No.43691597[source]▶

I had a related, but orthogonal question about multilingual LLMs.

When I ask smaller models a question in English, the model does well. When I ask the same model a question in Turkish, the answer is mediocre. When I ask the model to translate my question into English, get the answer, and translate the answer back to Turkish, the model again does well.

For example, I tried the above with Llama 3.3 70B, and asked it to plan me a 3-day trip to Istanbul. When I asked Llama to do the translations between English <> Turkish, the answer was notably better.

Anyone else observed a similar behavior?

replies(11): >>43691620 #>>43691751 #>>43691774 #>>43692427 #>>43692596 #>>43692803 #>>43692874 #>>43693906 #>>43695475 #>>43698229 #>>43698667 #

petesergeant ◴[15 Apr 25 12:15 UTC] No.43691620[source]▶

>>43691597 #

Fascinating phenomenon. It's like a new Sapir–Whorf hypothesis. Do language models act differently in different languages due to those languages or the training materials?

replies(3): >>43691662 #>>43691782 #>>43692839 #

1. evgen ◴[15 Apr 25 12:30 UTC] No.43691782[source]▶

>>43691620 #

This is one of those subtle clues that the LLM does not actually 'know' anything. It is providing you the best consensus answer to your prompt using the data upon which the weights rest, is that data was input primarily as english then you are going to get better results asking in english. It is still Searle's Chinese Room except you need to first go to the 'Language X -> English' room and then deliver its output to the general query room before delivering the next result to the 'English -> Language X' room.

replies(6): >>43691852 #>>43691872 #>>43691873 #>>43692157 #>>43692672 #>>43692745 #

2. jug ◴[15 Apr 25 12:38 UTC] No.43691852[source]▶

>>43691782 (TP) #

Anthropic’s research did find that Claude seemed to have an inner language agnostic ”language” though. And that the larger a LLM got, the more it could realize the innate meaning of words between language barriers as well as expand upon its internal non-specific language representation.

So, part of its improved performance as they grow in parameter count is probably not only due to expanded raw material that it is trained upon, but a greater ability to ultimately ”realize” and connect apparent meanings of words, so that a German speaker might benefit more and more from training material in Korean.

> These results show that features at the beginning and end of models are highly language-specific (consistent with the {de, re}-tokenization hypothesis [31] ), while features in the middle are more language-agnostic. Moreover, we observe that compared to the smaller model, Claude 3.5 Haiku exhibits a higher degree of generalization, and displays an especially notable generalization improvement for language pairs that do not share an alphabet (English-Chinese, French-Chinese).

Source: https://transformer-circuits.pub/2025/attribution-graphs/bio...

However, they do see that Claude 3.5 Haiku seemed to have an English ”default” with more direct connections. It’s possible that a LLM needs to go a more roundabout way via generalizations to communicate in alternative languages and where this causes a dropoff in performance the smaller the model is?

replies(2): >>43697883 #>>43701754 #

3. justlikereddit ◴[15 Apr 25 12:41 UTC] No.43691872[source]▶

>>43691782 (TP) #

Similar or better than the performance of most so called humans so I guess we're all a collection of Chinese room switchboxes.

4. keeganpoppen ◴[15 Apr 25 12:41 UTC] No.43691873[source]▶

>>43691782 (TP) #

Searle's "Chinese Room" was as wrong then as it is now

5. vjerancrnjak ◴[15 Apr 25 13:09 UTC] No.43692157[source]▶

>>43691782 (TP) #

Exactly. I found it surprising how soon it was implied "Imagine you're the smartest and most creative person in the world, ..." would somehow result in the most creative output.

It's clear from the start that language modelling is not yet there. It can't reason about low level structure (letters, syllables, rhyme, rhythm), it can't map all languages to a singular clear representation. Representation is mushy distributed mess out of which you get good or bad results.

It's brilliant how relevant the responses are and when they're correct, but the underlying process is driven by very weird internal representations.

6. sorenjan ◴[15 Apr 25 13:50 UTC] No.43692672[source]▶

>>43691782 (TP) #

It would be great if we could get to a point where we can use a language encoder and decoder, with a language agnostic knowledge model in between. But since it's generally more efficient to train the whole model end to end, such modularity would probably come at a performance price, and I don't see any private (or "non profit") companies take that approach anytime soon.

7. TimPC ◴[15 Apr 25 13:56 UTC] No.43692745[source]▶

>>43691782 (TP) #

My supervising professor for the PhD program I left did a paper on the Chinese Room and argued that to a large degree understanding of the task was the ability to compress it many orders of magnitude. In that sense the LLMs are succeeding because despite their supposively massive parameter sets they are absolutely tiny compared to the Chinese Room version.

8. numpad0 ◴[15 Apr 25 20:18 UTC] No.43697883[source]▶

>>43691852 #

The modern Standard Chinese language is almost syntactically "identical" to English, for some reason. French was direct ancestor to medieval British language that came to be the modern English.

My point is, those language pairs aren't random examples. Chinese isn't something completely foreign and new thing when it comes to difference between it and English.

9. ako ◴[16 Apr 25 05:19 UTC] No.43701754[source]▶

>>43691852 #

Sounds like it is capable of thinking in abstract concepts instead of words that are related/connected? So that training material in different languages would all add to knowledge on the same concepts?

It is like a student in school that is really brilliant in learning by heart, and repeating the words it studied, but not understanding the concept versus a student that actually understands the topic and can reason about the concepts.

↑