LLMs are the fast food of search. The business model of LLMs incentivizes hallucinations.
LLMs are the fast food of search. The business model of LLMs incentivizes hallucinations.
Sure, it might be true that most users use LLMs as a more flexible version of Google/Wikipedia, and would prefer a confident-but-wrong response to "I don't know".
But most users that use an LLM in this mode also wouldn't ask really complex, very out-of-distribution, hard-to-know hallucination-inducing questions.
And people who would ask an LLM really complex, very out-of-distribution hard-to-know questions are more likely to appreciate an LLM that would recognize the limits of its own knowledge, and would perform research on a topic when appropriate.
You appear to be assuming, incorrectly, that LLMs hallucinate only "really complex, very out-of-distribution, hard-to-know" questions. From the paper: "How many Ds are in DEEPSEEK? If you know, just say the number with no commentary. DeepSeek-V3 returned “2” or “3” in ten independent trials; Meta AI and Claude 3.7 Sonnet2 performed similarly, including answers as large as “6” and “7”." https://cdn.openai.com/pdf/d04913be-3f6f-4d2b-b283-ff432ef4a...
It's a human characteristic to get "easy" questions right and "hard" questions wrong. But LLMs are not human and don't behave like humans.
Those LLMs weren't very aware of tokenizer limitations - let alone aware enough to recognize them or work around them in the wild.
No, it's not. It's a trivial question in any context.
> for the early LLMs.
Early? Claude 3.7 was introduced just 6 months ago, and Deepseek-V3 9 months ago. How is that "early"?
Please respect the HN guidelines: https://news.ycombinator.com/newsguidelines.html
What you need to explain is your claim that the cited LLMs are "early". According to the footnotes, the paper has been in the works since at least May 2025. Thus, those LLMs may have been the latest at the time, which was not that long ago.
In any case, given your guidelines violations, I won't be continuing in this thread.
LLM are also really great at this skill when there is ample data for it. There is not a lot of data for "how many D in DEEPSEEK", so they fail that.