LLM Hallucinations in Practical Code Generation

(dl.acm.org)

65 points appwiz | 1 comments | 23 Jun 25 07:14 UTC | HN request time: 0.425s | source

Show context

simonw ◴[26 Jun 25 02:22 UTC] No.44383691[source]▶

I still don't think hallucinations in generated code matter very much. They show up the moment you try to run the code, and with the current batch of "coding agent" systems it's the LLM itself that spots the error when it attempts to run the code.

I was surprised that this paper talked more about RAG solutions than tool-use based solutions. Those seem to me like a proven solution at this point.

replies(4): >>44384474 #>>44384576 #>>44387027 #>>44388124 #

imiric ◴[26 Jun 25 05:38 UTC] No.44384474[source]▶

>>44383691 #

I'm surprised to read that from a prominent figure in the industry such as yourself.

The problem is that many hallucinations do not produce a runtime error, and can be very difficult to spot by a human, even if the code is thoroughly reviewed, which in many cases doesn't happen. These can introduce security issues, do completely different things from what the user asked (or didn't ask), do things inefficiently, ignore conventions and language idioms, or just be dead code.

For runtime errors, feeding them back to the LLM, as you say, might fix it. But even in those cases, the produced "fix" can often contain more hallucinations. I don't use agents, but I've often experienced the loop of pasting the error back to the LLM, only to get a confident yet non-working response using hallucinated APIs.

So this problem is not something external tools can solve, and requires a much deeper solution. RAG might be a good initial attempt, but I suspect an architectural solution will be needed to address the root cause. This is important because hallucination is a general problem, and doesn't affect just code generation.

replies(2): >>44384732 #>>44387733 #

simonw ◴[26 Jun 25 14:24 UTC] No.44387733[source]▶

>>44384474 #

If you define "hallucinations" to mean "any mistakes at all" then yes, a compiler won't catch them for you.

I define hallucinations as a a particular class of mistakes where the LLM invents eg a function or method that does not exist. Those are solved by ensuring the code runs. I wrote more about that here: https://simonwillison.net/2025/Mar/2/hallucinations-in-code/

Even beyond that more narrow definition of a hallucination, tool use is relevant to general mistakes made by an LLM. The new Phoenix.new coding agent actively tests the web applications it is writing using a headless browser, for example: https://simonwillison.net/2025/Jun/23/phoenix-new/

The more tools like this come into play, the less concern I have about the big black box of matrices occasionally hallucinating up some code that is broken in obvious or subtle ways.

It's still on us as the end users to confirm that the code written for us actually does the job we set out to solve. I'm fine with that too.

replies(2): >>44388994 #>>44390551 #

1. HarHarVeryFunny ◴[26 Jun 25 16:37 UTC] No.44388994[source]▶

>>44387733 #

I think the more general/useful definition of "hallucination" is anytime the LLM predicts next word based on "least worst" (statistically) choice rather than based on any closely matching samples in the training data.

The LLM has to generate some word each time it is called, and unless it recognizes soon enough that "I don't know" is the best answer (in of itself problematic, since any such prediction would be based on the training data, not the LLM's own aggregate knowledge!), then it may back itself into a corner where it has no well-grounded continuation, but nonetheless has to spit out the statistically best prediction, even if that is a very bad ungrounded prediction such as a non-existent API, "fits the profile" concocted answer, or anything else ...

Of course the LLM's output builds on itself, so any ungrounded/hallucinated output doesn't need to be limited to a single word or API call, but may instead consist of a whole "just trying my best" sentence or chunk of code (better hope you have unit test code coverage to test/catch it).

↑