Hallucinations in code are the least dangerous form of LLM mistakes

(simonwillison.net)

Show context

notepad0x90 ◴[02 Mar 25 23:24 UTC] No.43236385[source]▶

My fear is that LLM generated code will look great to me, I won't understand it fully but it will work. But since I didn't author it, I wouldn't be great at finding bugs in it or logical flaws. Especially if you consider coding as piecing together things instead of implementing a well designed plan. Lots of pieces making up the whole picture but a lot of those pieces are now put there by an algorithm making educated guesses.

Perhaps I'm just not that great of a coder, but I do have lots of code where if someone took a look it, it might look crazy but it really is the best solution I could find. I'm concerned LLMs won't do that, they won't take risks a human would or understand the implications of a block of code beyond its application in that specific context.

Other times, I feel like I'm pretty good at figuring out things and struggling in a time-efficient manner before arriving at a solution. LLM generated code is neat but I still have to spend similar amounts of time, except now I'm doing more QA and clean up work instead of debugging and figuring out new solutions, which isn't fun at all.

replies(13): >>43236847 #>>43237043 #>>43237101 #>>43237162 #>>43237387 #>>43237808 #>>43237956 #>>43238722 #>>43238763 #>>43238978 #>>43239372 #>>43239665 #>>43241112 #

hakaneskici ◴[03 Mar 25 01:36 UTC] No.43237387[source]▶

>>43236385 #

When it comes to relying on code that you didn't write yourself, like an npm package, do you care if it's AI code or human code? Do you think your trust toward AI code may change over time?

replies(2): >>43237605 #>>43237702 #

1. sfink ◴[03 Mar 25 02:36 UTC] No.43237702[source]▶

>>43237387 #

Of course I care. Human-written code was written for a purpose, with a set of constraints in mind, and other related code will have been written for the same or a complementary purpose and set of constraints. There is intention in the code. It is predictable in a certain way, and divergences from the expected are either because I don't fully understand something about the context or requirements, or because there's a damn good reason. It is worthwhile to dig further until I do understand, since it will very probably have repercussions elsewhere and elsewhen.

For AI code, that's a waste of time. The generated code will be based on an arbitrary patchwork of purposes and constraints, glued together well enough to function. I'm not saying it lacks purpose or constraints, it's just that those are inherited from random sources. The parts flow together with robotic but not human concern for consistency. It may incorporate brilliant solutions, but trying to infer intent or style or design philosophy is about as useful as doing handwriting analysis on a ransom note made from pasted-together newspaper clippings.

Both sorts of code have value. AI code may be well-commented. It may use features effectively that a human might have missed. Just don't try to anthropomorphize an AI coder or a lawnmower, you'll end up inventing an intent that doesn't exist.

replies(2): >>43238390 #>>43246765 #

2. gunian ◴[03 Mar 25 04:46 UTC] No.43238390[source]▶

>>43237702 (TP) #

what if you

- generate - lint - format - fuzz - test - update

infintely?

replies(2): >>43238455 #>>43238776 #

3. sfink ◴[03 Mar 25 05:01 UTC] No.43238455[source]▶

>>43238390 #

Then you'll get code that passes the tests you generate, where "tests" includes whatever you feed the fuzzer to detect problems. (Just crashes? Timeouts? Comparison with a gold standard?)

Sorry, I'm failing to see your point.

Are you implying that the above is good enough, for a useful definition of good enough? I'm not disagreeing, and in fact that was my starting assumption in the message you're replying to.

Crap code can pass tests. Slow code can pass tests. Weird code can pass tests. Sometimes it's fine for code to be crap, slow, and/or weird. If that's your situation, then go ahead and use the code.

To expand on why someone might not want such code, think of your overall codebase as having a time budget, a complexity budget, a debuggability budget, an incoherence budget, and a maintenance budget. Yes, those overlap a bunch. A pile of AI-written code has a higher chance of exceeding some of those budgets than a human-written codebase would. Yes, there will be counterexamples. But humans will at least attempt to optimize for such things. AIs mostly won't. The AI-and-AI-using-human system will optimize for making it through your lint-fuzz-test cycle successfully and little else.

Different constraints, different outputs. Only you can decide whether the difference matters to you.

replies(2): >>43239125 #>>43251223 #

4. intended ◴[03 Mar 25 06:03 UTC] No.43238776[source]▶

>>43238390 #

Who has that much time and money when your boss is breathing down your neck?

5. pixelfarmer ◴[03 Mar 25 07:18 UTC] No.43239125{3}[source]▶

>>43238455 #

> Then you'll get code that passes the tests you generate

Just recently I think here on HN there was a discussion about how neural networks optimize towards the goal they are given, which in this case means exactly what you wrote, including that the code will do stuff in wrong ways just to pass the given tests.

Where do the tests come from? Initially from a specification of what "that thing" is supposed to do and also not supposed to do. Everyone who had to deal with specifications in a serious way knows how insanely difficult it is to get these right, because there are often things unsaid, there are corner cases not covered and so on. So the problem of correctness is just shifted, and the assumption that this may require less time than actually coding ... I wouldn't bet on it.

Conceptually the idea should work, though.

6. gunian ◴[04 Mar 25 06:57 UTC] No.43251223{3}[source]▶

>>43238455 #

what if you thought of your codebase as something similar to human DNA and the LLM as nature and the entire process as some sort of evolutionary process? the fitness function would be no panics exceptions and latency instead of some random KPI or OKR pr who likes working with who or who made who laugh

it's what our lord and savior jesus christ uses for us humans if it is good for him its good enough for me. and surely google is not laying off 25k people because it believes humans are better than their LLMs :)

↑