Why language models hallucinate

(openai.com)

277 points simianwords | 5 comments | 06 Sep 25 07:41 UTC | HN request time: 0.863s | source

Show context

lapcat ◴[06 Sep 25 15:43 UTC] No.45150271[source]▶

Let's be honest: many users of LLMs have no interest in uncertainty. They don't want to hear "I don't know" and if given that response would quickly switch to an alternative service that gives them a definitive answer. The users would rather have a quick answer than a correct answer. People who are more circumspect, and value truth over speed, would and should avoid LLMs in favor of "old-fashioned methods" of discovering facts.

LLMs are the fast food of search. The business model of LLMs incentivizes hallucinations.

replies(1): >>45150386 #

ACCount37 ◴[06 Sep 25 15:56 UTC] No.45150386[source]▶

>>45150271 #

I don't think that's actually true.

Sure, it might be true that most users use LLMs as a more flexible version of Google/Wikipedia, and would prefer a confident-but-wrong response to "I don't know".

But most users that use an LLM in this mode also wouldn't ask really complex, very out-of-distribution, hard-to-know hallucination-inducing questions.

And people who would ask an LLM really complex, very out-of-distribution hard-to-know questions are more likely to appreciate an LLM that would recognize the limits of its own knowledge, and would perform research on a topic when appropriate.

replies(1): >>45150438 #

lapcat ◴[06 Sep 25 16:02 UTC] No.45150438[source]▶

>>45150386 #

> But most users that use an LLM in this mode also wouldn't ask really complex, very out-of-distribution, hard-to-know hallucination-inducing questions.

You appear to be assuming, incorrectly, that LLMs hallucinate only "really complex, very out-of-distribution, hard-to-know" questions. From the paper: "How many Ds are in DEEPSEEK? If you know, just say the number with no commentary. DeepSeek-V3 returned “2” or “3” in ten independent trials; Meta AI and Claude 3.7 Sonnet2 performed similarly, including answers as large as “6” and “7”." https://cdn.openai.com/pdf/d04913be-3f6f-4d2b-b283-ff432ef4a...

It's a human characteristic to get "easy" questions right and "hard" questions wrong. But LLMs are not human and don't behave like humans.

replies(1): >>45150532 #

1. ACCount37 ◴[06 Sep 25 16:13 UTC] No.45150532[source]▶

>>45150438 #

That's a really complex, very out-of-distibution, hard-to-know question for the early LLMs. Not that it's too hard to fix that, mind.

Those LLMs weren't very aware of tokenizer limitations - let alone aware enough to recognize them or work around them in the wild.

replies(1): >>45150594 #

2. lapcat ◴[06 Sep 25 16:20 UTC] No.45150594[source]▶

>>45150532 (TP) #

> That's a really complex, very out-of-distibution, hard-to-know question

No, it's not. It's a trivial question in any context.

> for the early LLMs.

Early? Claude 3.7 was introduced just 6 months ago, and Deepseek-V3 9 months ago. How is that "early"?

replies(1): >>45150625 #

3. ACCount37 ◴[06 Sep 25 16:24 UTC] No.45150625[source]▶

>>45150594 #

Do I really have to explain what the fuck a "tokenizer" is, and why does this question hit the tokenizer limitations? And thus requires extra metacognitive skills for an LLM to be able to answer it correctly?

replies(2): >>45150739 #>>45151286 #

4. lapcat ◴[06 Sep 25 16:36 UTC] No.45150739{3}[source]▶

>>45150625 #

> Do I really have to explain what the fuck

Please respect the HN guidelines: https://news.ycombinator.com/newsguidelines.html

What you need to explain is your claim that the cited LLMs are "early". According to the footnotes, the paper has been in the works since at least May 2025. Thus, those LLMs may have been the latest at the time, which was not that long ago.

In any case, given your guidelines violations, I won't be continuing in this thread.

5. Jensson ◴[06 Sep 25 17:37 UTC] No.45151286{3}[source]▶

>>45150625 #

The only "metacognitive" skill it needs is to know how many D there are in every token, and sum those up. Humans are great at that sort of skill, which is why they can answer that sort of question even in languages where each letter is a group of sounds and not just one like Japanese katakana, that is not hard at all.

LLM are also really great at this skill when there is ample data for it. There is not a lot of data for "how many D in DEEPSEEK", so they fail that.

↑