Most active commenters
  • petesergeant(3)
  • shaky-carrousel(3)

←back to thread

248 points doener | 18 comments | | HN request time: 1.48s | source | bottom
Show context
ozgune ◴[] No.43691597[source]
I had a related, but orthogonal question about multilingual LLMs.

When I ask smaller models a question in English, the model does well. When I ask the same model a question in Turkish, the answer is mediocre. When I ask the model to translate my question into English, get the answer, and translate the answer back to Turkish, the model again does well.

For example, I tried the above with Llama 3.3 70B, and asked it to plan me a 3-day trip to Istanbul. When I asked Llama to do the translations between English <> Turkish, the answer was notably better.

Anyone else observed a similar behavior?

replies(11): >>43691620 #>>43691751 #>>43691774 #>>43692427 #>>43692596 #>>43692803 #>>43692874 #>>43693906 #>>43695475 #>>43698229 #>>43698667 #
1. petesergeant ◴[] No.43691620[source]
Fascinating phenomenon. It's like a new Sapir–Whorf hypothesis. Do language models act differently in different languages due to those languages or the training materials?
replies(3): >>43691662 #>>43691782 #>>43692839 #
2. shaky-carrousel ◴[] No.43691662[source]
They absolutely do. They know more in English than in Spanish, I've seen that on all models, since the beginning.
replies(1): >>43697719 #
3. evgen ◴[] No.43691782[source]
This is one of those subtle clues that the LLM does not actually 'know' anything. It is providing you the best consensus answer to your prompt using the data upon which the weights rest, is that data was input primarily as english then you are going to get better results asking in english. It is still Searle's Chinese Room except you need to first go to the 'Language X -> English' room and then deliver its output to the general query room before delivering the next result to the 'English -> Language X' room.
replies(6): >>43691852 #>>43691872 #>>43691873 #>>43692157 #>>43692672 #>>43692745 #
4. jug ◴[] No.43691852[source]
Anthropic’s research did find that Claude seemed to have an inner language agnostic ”language” though. And that the larger a LLM got, the more it could realize the innate meaning of words between language barriers as well as expand upon its internal non-specific language representation.

So, part of its improved performance as they grow in parameter count is probably not only due to expanded raw material that it is trained upon, but a greater ability to ultimately ”realize” and connect apparent meanings of words, so that a German speaker might benefit more and more from training material in Korean.

> These results show that features at the beginning and end of models are highly language-specific (consistent with the {de, re}-tokenization hypothesis [31] ), while features in the middle are more language-agnostic. Moreover, we observe that compared to the smaller model, Claude 3.5 Haiku exhibits a higher degree of generalization, and displays an especially notable generalization improvement for language pairs that do not share an alphabet (English-Chinese, French-Chinese).

Source: https://transformer-circuits.pub/2025/attribution-graphs/bio...

However, they do see that Claude 3.5 Haiku seemed to have an English ”default” with more direct connections. It’s possible that a LLM needs to go a more roundabout way via generalizations to communicate in alternative languages and where this causes a dropoff in performance the smaller the model is?

replies(2): >>43697883 #>>43701754 #
5. justlikereddit ◴[] No.43691872[source]
Similar or better than the performance of most so called humans so I guess we're all a collection of Chinese room switchboxes.
6. keeganpoppen ◴[] No.43691873[source]
Searle's "Chinese Room" was as wrong then as it is now
7. vjerancrnjak ◴[] No.43692157[source]
Exactly. I found it surprising how soon it was implied "Imagine you're the smartest and most creative person in the world, ..." would somehow result in the most creative output.

It's clear from the start that language modelling is not yet there. It can't reason about low level structure (letters, syllables, rhyme, rhythm), it can't map all languages to a singular clear representation. Representation is mushy distributed mess out of which you get good or bad results.

It's brilliant how relevant the responses are and when they're correct, but the underlying process is driven by very weird internal representations.

8. sorenjan ◴[] No.43692672[source]
It would be great if we could get to a point where we can use a language encoder and decoder, with a language agnostic knowledge model in between. But since it's generally more efficient to train the whole model end to end, such modularity would probably come at a performance price, and I don't see any private (or "non profit") companies take that approach anytime soon.
9. TimPC ◴[] No.43692745[source]
My supervising professor for the PhD program I left did a paper on the Chinese Room and argued that to a large degree understanding of the task was the ability to compress it many orders of magnitude. In that sense the LLMs are succeeding because despite their supposively massive parameter sets they are absolutely tiny compared to the Chinese Room version.
10. input_sh ◴[] No.43692839[source]
Both, but primarily due to the lack of training materials. 10 or so million native speakers of my language will never be able to generate the same amount of training material as over a billion English speakers do.

There is a steep drop in quality in any non-English language, but in general less native speakers = worse results. They tend to have a certain "voice" which is extremely easy to spot and the accuracy of results goes out the window (way worse than in English).

replies(1): >>43693716 #
11. petesergeant ◴[] No.43693716[source]
Right, but it’s interesting that means its reasoning abilities potentially drop off when it’s talking Thai, or its knowledge of WW2 history in the Eastern Theatre might drop off when speaking French, where the same model has no trouble with the same questions in English. My French and Thai are both rudimentary, but I’m working from the same set of facts and reasoning ability in both languages. Will it give different answers on what the greatest empire that ever existed was if you ask it in Mandarin vs Italian vs Mongolian?
12. namaria ◴[] No.43697719[source]
They have more data in English than Spanish. LLMs don't know or reason or follow instructions. They merely render text continuations that are coherent with the expectations you set when prompting. The fact that they are not able to sustain the illusion in languages with less available training data than English should make that clear.
replies(1): >>43699837 #
13. numpad0 ◴[] No.43697883{3}[source]
The modern Standard Chinese language is almost syntactically "identical" to English, for some reason. French was direct ancestor to medieval British language that came to be the modern English.

My point is, those language pairs aren't random examples. Chinese isn't something completely foreign and new thing when it comes to difference between it and English.

14. shaky-carrousel ◴[] No.43699837{3}[source]
> They have more data in English than Spanish.

Yep, that there seems like the definition of knowing. Don't worry, your humanity isn't at risk.

replies(1): >>43701737 #
15. namaria ◴[] No.43701737{4}[source]
No, mental models matter. This has nothing to do with AGI doomerism.

Knowing implies reasoning. LLMs don't "know" things. These statistical models continuate text. Having a mental model that they "know" things, that they can "reason" or "follow instructions" is driving all sorts of poor decisions.

Software has an abstraction fetish. So much of the material available for learners is riddled with analogies and "you don't need to know that" attitude. That is counter productive and I think having accurate mental models matters.

replies(2): >>43701833 #>>43710415 #
16. ako ◴[] No.43701754{3}[source]
Sounds like it is capable of thinking in abstract concepts instead of words that are related/connected? So that training material in different languages would all add to knowledge on the same concepts?

It is like a student in school that is really brilliant in learning by heart, and repeating the words it studied, but not understanding the concept versus a student that actually understands the topic and can reason about the concepts.

17. petesergeant ◴[] No.43701833{5}[source]
> Knowing implies reasoning

That's not really clear-cut, that's simply a position you're taking. JTB could (I reckon) say that a model's "knowledge" is justified by the training process and reward functions.

> LLMs don't "know" things. These statistical models continuate text.

I don't think it's clear to anyone at this point whether or not the steps taken before token selection (eg: the journey through their dimensional knowledge space provided by attention) are close to or far from how our own thought processes work, but the description of LLMs as "simply" continuating text reduces them to their outputs. From my perspective, as someone on the other side of a text-based web-app from you, you also are an entity that simply continuates text.

You have no way of knowing whether this comment was written by a sentient entity -- with thoughts and agency -- or an LLM.

18. shaky-carrousel ◴[] No.43710415{5}[source]
I have to disagree. We've been using "knowing" for programs for decades without requiring it to imply reasoning. Just because the output now looks more realistic doesn't mean we need to suddenly get philosophical about it. That shift says more about us than about the software.

And while accurate mental models can help in certain contexts, they're not always necessary. I don't need a detailed model of how my OS handles file operations to use it effectively. A high-level understanding is usually enough. Insisting on deep internal accuracy in every case seems more like gatekeeping than good practice.