Hallucinations in code are the least dangerous form of LLM mistakes

1. layer8 ◴[02 Mar 25 22:11 UTC] No.43235766[source]▶

> Just because code looks good and runs without errors doesn’t mean it’s actually doing the right thing. No amount of meticulous code review—or even comprehensive automated tests—will demonstrably prove that code actually does the right thing. You have to run it yourself!

I would have stated this a bit differently: No amount of running or testing can prove the code correct. You actually have to reason through it. Running/testing is merely a sanity/spot check of your reasoning.

replies(4): >>43235828 #>>43235856 #>>43236195 #>>43236756 #

2. dmos62 ◴[02 Mar 25 22:18 UTC] No.43235828[source]▶

>>43235766 (TP) #

Well, what if you run a complete test suite?

replies(5): >>43236019 #>>43236118 #>>43236949 #>>43237000 #>>43237496 #

3. nnnnico ◴[02 Mar 25 22:21 UTC] No.43235856[source]▶

>>43235766 (TP) #

not sure that human reasoning actually beats testing when checking for correctness

replies(4): >>43235988 #>>43236031 #>>43237608 #>>43239891 #

4. ljm ◴[02 Mar 25 22:35 UTC] No.43235988[source]▶

>>43235856 #

The production of such tests presumably requires an element of human reasoning.

The requirements have to come from somewhere, after all.

replies(1): >>43241552 #

5. layer8 ◴[02 Mar 25 22:39 UTC] No.43236019[source]▶

>>43235828 #

There is no complete test suite, unless your code is purely functional and has a small-ish finite input domain.

replies(2): >>43236080 #>>43236539 #

6. layer8 ◴[02 Mar 25 22:41 UTC] No.43236031[source]▶

>>43235856 #

Both are necessary, they complement each other.

7. suzzer99 ◴[02 Mar 25 22:47 UTC] No.43236080{3}[source]▶

>>43236019 #

And even then, your code could pass all tests but be a spaghetti mess that will be impossible to maintain and add features to.

8. e12e ◴[02 Mar 25 22:50 UTC] No.43236118[source]▶

>>43235828 #

You mean, for example test that your sieve finds all primes, and only primes that fit in 4096 bits?

9. Snuggly73 ◴[02 Mar 25 23:01 UTC] No.43236195[source]▶

>>43235766 (TP) #

Agree - case in point - dealing with race conditions. You have to reason thru the code.

replies(1): >>43239453 #

10. MattSayar ◴[02 Mar 25 23:41 UTC] No.43236539{3}[source]▶

>>43236019 #

Seems to be a bit of a catch 22. No LLM can write perfect code, and no test suite can catch all bugs. Obviously, no human can write perfect code either.

If LLM-generated code has been "reasoned-through," tested, and it does the job, I think that's a net-benefit compared to human-only generated code.

replies(1): >>43236792 #

11. johnrob ◴[03 Mar 25 00:09 UTC] No.43236756[source]▶

>>43235766 (TP) #

I’m not sure it’s possible to have the full reasoning in your head without authoring the code yourself - or, spending a comparable amount of effort to mentally rewrite it.

replies(4): >>43236933 #>>43239075 #>>43240932 #>>43241497 #

12. unclebucknasty ◴[03 Mar 25 00:15 UTC] No.43236792{4}[source]▶

>>43236539 #

>I think that's a net-benefit compared to human-only generated code.

Net-benefit in what terms though? More productive WRT raw code output? Lower error rate?

Because, something about the idea of generating tons of code via LLMs, which humans have to then verify, seems less productive to me and more error-prone.

I mean, when verifying code that you didn't write, you generally have to fully reason through it, just as you would to write it (if you really want to verify it). But, reasoning through someone else's code requires an extra step to latch on to the author's line of reasoning.

OTOH, if you just breeze through it because it looks correct, you're likely to miss errors.

The latter reminds me of the whole "Full self-driving, but keep your hands on the steering wheel, just in case" setup. It's going to lull you into overconfidence and passivity.

replies(2): >>43236832 #>>43238760 #

13. jmb99 ◴[03 Mar 25 00:20 UTC] No.43236832{5}[source]▶

>>43236792 #

> reasoning through someone else's code requires an extra step to latch on to the author's line of reasoning.

And, in my experience, it’s a lot easier to latch on to a real person’s real line of reasoning rather than a chatbot’s “line of reasoning”

replies(2): >>43237380 #>>43239750 #

14. layer8 ◴[03 Mar 25 00:31 UTC] No.43236933[source]▶

>>43236756 #

I tend to agree, which is why I’m skeptical about large-scale LLM code generation, until AIs exhibit reliable diligence and more general attention and awareness, and probably also long-term memory about a code base and its application domain.

15. ◴[03 Mar 25 00:33 UTC] No.43236949[source]▶

>>43235828 #

16. shakna ◴[03 Mar 25 00:39 UTC] No.43237000[source]▶

>>43235828 #

If the complete test suite were enough, then SQLite, who famously has one of the largest and most comprehensive, would not encounter bugs. However, they still do.

If you employ AI, you're adding a remarkable amount of speed, to a processing domain that is undecidable because most inputs are not finite. Eventually, you will end up reconsidering the Gambler's Fallacy, because of the chances of things going wrong.

17. unclebucknasty ◴[03 Mar 25 01:35 UTC] No.43237380{6}[source]▶

>>43236832 #

Exactly. And, if correction is required, then you either re-write it or you're stuck maintaining whatever odd way the LLM approached the problem, whether it's as optimal (or readable) as a human's or not.

18. bandrami ◴[03 Mar 25 01:57 UTC] No.43237496[source]▶

>>43235828 #

Paging Dr. Turing. Dr. Turing, please report to the HN comment section.

replies(1): >>43257383 #

19. fragmede ◴[03 Mar 25 02:19 UTC] No.43237608[source]▶

>>43235856 #

Human reason is fine, the problem is that human attention spans aren't great at checking for correctness. I want every corner case regression tested automatically because there's always going to be some weird configuration that a human's going to forget to regression test.

replies(1): >>43240867 #

20. rapind ◴[03 Mar 25 06:01 UTC] No.43238760{5}[source]▶

>>43236792 #

> "Full self-driving, but keep your hands on the steering wheel, just in case" setup

This is actually a trick though. No one working on self driving actually expects people to actually babysit it for long at all. Babysitting actually feels worse than driving. I just saw a video on self-driving trucks and how the human driver had his hands hovering on the wheel. The goal of the video is to make you think about how amazing self-driving rigs will be, but all I could think about was what an absolutely horrible job it will be to babysit these things.

Working full-time on AI code reviews sounds even worse. Maybe if it's more of a conversation and you're collaboratively iterating on small chunks of code then it wouldn't be so bad. In reality though, we'll just end up trusting the AI because it'll save us a ton of money and we'll find a way to externalize the screw ups.

21. skydhash ◴[03 Mar 25 07:06 UTC] No.43239075[source]▶

>>43236756 #

Which is why everyone is so keen on standards (Convention, formatting, architecture,...), because it is less a burden when you're just comparing expected to actual, than learning unknowns.

22. wfn ◴[03 Mar 25 08:12 UTC] No.43239453[source]▶

>>43236195 #

> case in point - dealing with race conditions.

100%. Case in point for case in point - I was just scratching my head over some Claude-produced lines for me, thinking if I should ask what this kind entity had in mind when using specific compiler builtins (vs. <stdatomic.h>), like, "is there logic to your madness..." :D

  size_t unique_ips = __atomic_load_n(&((ip_database_t*)arg)->unique_ip_count, __ATOMIC_SEQ_CST);

I think it just likes compiler builtins because I mentioned GCC at some point...

23. Ekaros ◴[03 Mar 25 08:53 UTC] No.43239750{6}[source]▶

>>43236832 #

Also after reasonable period if you are stuck you can actually ask them what were they thinking and why was it written that way and what are the constrains they thought of.

And you can discuss these, with both of you hopefully having experience in the domain.

24. Gupie ◴[03 Mar 25 09:14 UTC] No.43239891[source]▶

>>43235856 #

"Beware of bugs in the above code; I have only proved it correct, not tried it."

Donald E. Knuth

25. sarchertech ◴[03 Mar 25 11:52 UTC] No.43240867{3}[source]▶

>>43237608 #

With any non trivial system you can’t actually test every corner case. You depend on human reason to identify the ones most likely to cause problems.

26. theshrike79 ◴[03 Mar 25 12:02 UTC] No.43240932[source]▶

>>43236756 #

Spoken by someone who hasn't had to maintain Somene Else's Code on a budget.

You can't just rewrite everything to match your style. You take what's in there and adapt to the style, your personal preference doesn't matter.

replies(3): >>43241158 #>>43244710 #>>43249347 #

27. layer8 ◴[03 Mar 25 12:36 UTC] No.43241158{3}[source]▶

>>43240932 #

They said “mentally rewrite”, not actually rewrite.

28. tuyiown ◴[03 Mar 25 13:26 UTC] No.43241497[source]▶

>>43236756 #

> spending a comparable amount of effort to mentally rewrite it.

I'm pretty sure mentally rewrite it requires _more_ effort than writing it in the first place. (maybe less time though)

29. MrMcCall ◴[03 Mar 25 13:34 UTC] No.43241552{3}[source]▶

>>43235988 #

I would argue that designing and implementing a working project requires human reasoning, too, but that line of thinking seems to be falling out of fashion in favor of "best next token" guessing engines.

I know what Spock would say about this approach, and I'm with him.

30. horsawlarway ◴[03 Mar 25 18:03 UTC] No.43244710{3}[source]▶

>>43240932 #

It's a giant misdirection to assume the complaint is "style".

Writing is a very solid choice as an approach to understanding a novel problem. There's a quip in academia - "The best way to know if you understand something is to try teaching it to someone else". This happens to hold true for teaching it to the compiler with code you've written.

You can't skip details or gloss over things, and you have to hold "all the parts" of the problem together in your head. It builds a very strong intuitive understanding.

Once you have an intuitive understanding of the problem, it's very easy to drop into several different implementations of the solution (regardless of the style) and reason about them.

On the other hand, if you don't understand the problem, it's nearly impossible to have a good feel for why any given solution does what it does, or where it might be getting things wrong.

---

The problem with using an AI to generate the code for you is that unless you're already familiar with the problem you risk being completely out of your depth "code reviewing" the output.

The difficulty in the review isn't just literally reading the lines of code - it's in understanding the problem well enough to make a judgement call about them.

31. np- ◴[04 Mar 25 02:02 UTC] No.43249347{3}[source]▶

>>43240932 #

Someone Else’s Code was understood by at least one human at some point in time before it was committed. That means that another equally skilled human is likely to be able to get the gist of it, if not understand it perfectly.

32. dmos62 ◴[04 Mar 25 17:08 UTC] No.43257383{3}[source]▶

>>43237496 #

Gave me a chuckle!