LLM Hallucinations in Practical Code Generation

(dl.acm.org)

Show context

simonw ◴[26 Jun 25 02:22 UTC] No.44383691[source]▶

I still don't think hallucinations in generated code matter very much. They show up the moment you try to run the code, and with the current batch of "coding agent" systems it's the LLM itself that spots the error when it attempts to run the code.

I was surprised that this paper talked more about RAG solutions than tool-use based solutions. Those seem to me like a proven solution at this point.

replies(4): >>44384474 #>>44384576 #>>44387027 #>>44388124 #

imiric ◴[26 Jun 25 05:38 UTC] No.44384474[source]▶

>>44383691 #

I'm surprised to read that from a prominent figure in the industry such as yourself.

The problem is that many hallucinations do not produce a runtime error, and can be very difficult to spot by a human, even if the code is thoroughly reviewed, which in many cases doesn't happen. These can introduce security issues, do completely different things from what the user asked (or didn't ask), do things inefficiently, ignore conventions and language idioms, or just be dead code.

For runtime errors, feeding them back to the LLM, as you say, might fix it. But even in those cases, the produced "fix" can often contain more hallucinations. I don't use agents, but I've often experienced the loop of pasting the error back to the LLM, only to get a confident yet non-working response using hallucinated APIs.

So this problem is not something external tools can solve, and requires a much deeper solution. RAG might be a good initial attempt, but I suspect an architectural solution will be needed to address the root cause. This is important because hallucination is a general problem, and doesn't affect just code generation.

replies(2): >>44384732 #>>44387733 #

1. lelele ◴[26 Jun 25 06:32 UTC] No.44384732[source]▶

>>44384474 #

> The problem is that many hallucinations do not produce a runtime error [...]

Don't hallucinations mean nonexistent things, that is, in the case of code: functions, classes, etc. How could they fail to lead to a runtime error, then? The fact that LLMs can produce unreliable or inefficient code is a different problem, isn't it?

replies(2): >>44384780 #>>44385134 #

2. plausibilitious ◴[26 Jun 25 06:41 UTC] No.44384780[source]▶

>>44384732 (TP) #

This argument is the reason why LLM output failing to match reality was labelled 'hallucination'. It makes it seem like the LLM only makes mistakes in a neatly verifiable manner.

The 'jpeg of the internet' argument was more apt I think. The output of LLMs might be congruent with reality and how the prompt contents represent reality. But they might also not be, and in subtle ways too.

If only all code that has any flaw in it would not run. That would be truly amazing. Alas, there are several orders of magnitude more sequences of commands that can be run than that should be run.

3. imiric ◴[26 Jun 25 07:41 UTC] No.44385134[source]▶

>>44384732 (TP) #

Hallucinations can be manifested in different ways. Using nonexistent APIs is just one of them. The LLM could just as well hallucinate code that doesn't fix a problem, or hallucinate that a problem exists in the first place, all while using existing APIs. This might not be a major issue for tasks like programming where humans can relatively easily verify the output, but in other scientific fields this can be much more labor-intensive and practically infeasible to do, as this recent example[1] showcases. So hallucination is a problem that involves any fabricated output that isn't grounded in reality.

Which isn't to say that it is a universal problem. In some applications such as image, video or audio generation, especially in entertainment industries, hallucinations can be desirable. They're partly what we identify as "creativity", and the results can be fun and interesting. But in applications where facts and reality matter, they're a big problem.

[1]: https://news.ycombinator.com/item?id=44174965

replies(1): >>44389078 #

4. HarHarVeryFunny ◴[26 Jun 25 16:47 UTC] No.44389078[source]▶

>>44385134 #

You can test every line of code in your program, but how many people actually do?

It's one thing if you are just creating a throwaway prototype, or something so simple that you will naturally exercise 100% of the code when testing it, but when you start building anything non-trivial it's easy to have many code paths/flows that are rarely executed or tested. Maybe you wrote unit tests for all the obvious corner cases, but did you consider the code correctness when conditions A, then B, then C ... occurs?). Even 100% code coverage (every line of code tested) isn't going to help you there.

replies(1): >>44389587 #

5. simonw ◴[26 Jun 25 17:44 UTC] No.44389587{3}[source]▶

>>44389078 #

> You can test every line of code in your program, but how many people actually do?

In my mind, that's what separates genuinely excellent professional programmers from everybody else.

replies(1): >>44392437 #

6. HarHarVeryFunny ◴[26 Jun 25 23:25 UTC] No.44392437{4}[source]▶

>>44389587 #

I think it's perhaps more that you learn to write code that is easy to test and debug, consisting of some minimal set of simple orthogonal components, etc. You test every function of course, but learn to intuitively design out these combinatorial complexities that could nonetheless still be lurking, and pre-emptively include assertions to try to catch anything you may have overlooked.

↑