I was surprised that this paper talked more about RAG solutions than tool-use based solutions. Those seem to me like a proven solution at this point.
I was surprised that this paper talked more about RAG solutions than tool-use based solutions. Those seem to me like a proven solution at this point.
The problem is that many hallucinations do not produce a runtime error, and can be very difficult to spot by a human, even if the code is thoroughly reviewed, which in many cases doesn't happen. These can introduce security issues, do completely different things from what the user asked (or didn't ask), do things inefficiently, ignore conventions and language idioms, or just be dead code.
For runtime errors, feeding them back to the LLM, as you say, might fix it. But even in those cases, the produced "fix" can often contain more hallucinations. I don't use agents, but I've often experienced the loop of pasting the error back to the LLM, only to get a confident yet non-working response using hallucinated APIs.
So this problem is not something external tools can solve, and requires a much deeper solution. RAG might be a good initial attempt, but I suspect an architectural solution will be needed to address the root cause. This is important because hallucination is a general problem, and doesn't affect just code generation.
Don't hallucinations mean nonexistent things, that is, in the case of code: functions, classes, etc. How could they fail to lead to a runtime error, then? The fact that LLMs can produce unreliable or inefficient code is a different problem, isn't it?
Which isn't to say that it is a universal problem. In some applications such as image, video or audio generation, especially in entertainment industries, hallucinations can be desirable. They're partly what we identify as "creativity", and the results can be fun and interesting. But in applications where facts and reality matter, they're a big problem.
It's one thing if you are just creating a throwaway prototype, or something so simple that you will naturally exercise 100% of the code when testing it, but when you start building anything non-trivial it's easy to have many code paths/flows that are rarely executed or tested. Maybe you wrote unit tests for all the obvious corner cases, but did you consider the code correctness when conditions A, then B, then C ... occurs?). Even 100% code coverage (every line of code tested) isn't going to help you there.