←back to thread

192 points imasl42 | 1 comments | | HN request time: 0s | source
Show context
rsynnott ◴[] No.45311963[source]
This idea that you can get good results from a bad process as long as you have good quality control seems… dubious, to say the least. “Sure, it’ll produce endless broken nonsense, but as long as someone is checking, it’s fine.” This, generally, doesn’t really work. You see people _try_ it in industry a bit; have a process which produces a high rate of failures, catch them in QA, rework (the US car industry used to be notorious for this). I don’t know of any case where it has really worked out.

Imagine that your boss came to you, the tech lead of a small team, and said “okay, instead of having five competent people, your team will now have 25 complete idiots. We expect that their random flailing will sometimes produce stuff that kinda works, and it will be your job to review it all.” Now, you would, of course, think that your boss had gone crazy. No-one would expect this to produce good results. But somehow, stick ‘AI’ on this scenario, and a lot of people start to think “hey, maybe that could work.”

replies(21): >>45312004 #>>45312107 #>>45312114 #>>45312162 #>>45312253 #>>45312382 #>>45312761 #>>45312937 #>>45313024 #>>45313048 #>>45313151 #>>45313284 #>>45313721 #>>45316157 #>>45317467 #>>45317732 #>>45319692 #>>45321588 #>>45322932 #>>45326919 #>>45329123 #
1. xyzzy123 ◴[] No.45312382[source]
I have a play project which hits these constraints a lot.

I have been messing around with getting AI to implement novel (to me) data structures from papers. They're not rocket science or anything but there's a lot of detail. Often I do not understand the complex edge cases in the algorithms myself so I can't even "review my way out of it". I'm also working in go which is usually not a very good fit for implementing these things because it doesn't have sum types; lack of sum types oten adds so much interface{} bloat it would render the data structure pointless. Am working around with codegen for now.

What I've had to do is demote "human review" a bit; it's a critical control but it's expensive. Rather, think more holistically about "guard rails" to put where and what the acceptance criteria should be. This means that when I'm reviewing the code I am reasonably confident it's functionally correct, leaving me to focus on whether I like how that is being achieved. This won't work for every domain, but if it's possible to automate controls, it feels like this is the way to go wherever possible.

The "principled" way to do this would be to use provers etc, but being more of an engineer I have resorted to ruthless guard rails. Bench tests that automatically fail if the runtime doesn't meet requirements (e.g. is O(n) instead of O(log n)) or overall memory efficiency is too low - and enforcing 100% code coverage from both unit tests AND fuzzing. Sometimes the cli agent is running for hours chasing indexes or weird bugs; the two main tasks are preventing it from giving up, and stopping it from "punting" (wait, this isn't working, let me first create a 100% correct O(n) version...) or cheating. Also reminding it to check AGAIN for slice sharing bugs which crop up a surprising % of the time.

The other "interesting" part of my workflow right now is that I have to manually shuffle a lot between "deep research" (which goes and reads all the papers and blogs about the data structure) and the cli agent which finds the practical bugs etc but often doesn't have the "firepower" to recognise when it's stuck in a local maximum or going around in circles. Have been thinking about an MCP that lets the cli agent call out to "deep research" when it gets really stuck.