The Ironies of Automation paper is something I mention a lot, the core thesis is that making humans review / rubber stamp automation reduces their work quality. People just aren't wired to do boring stuff well.
OpenAI's Codex Cloud just added a new feature for code review, and their new GPT-5-Codex model has been specifically trained for code review: https://openai.com/index/introducing-upgrades-to-codex/
Gemini and Claude both have code review features that work via GitHub Actions: https://developers.google.com/gemini-code-assist/docs/review... and https://docs.claude.com/en/docs/claude-code/github-actions
GitHub have their own version of this pattern too: https://github.blog/changelog/2025-04-04-copilot-code-review...
There are also a whole lot of dedicated code review startups like https://coderabbit.ai/ and https://www.greptile.com/ and https://www.qodo.ai/products/qodo-merge/
The ability to ignore AI and focus on solving the problems has little to do with "fun". If anything it leaves a human-auditable trail to review later and hold accountable devs who have gone off the rails and routinely ignored the sometimes genuinely good advice that comes out of AI.
If humans don't have to helicopter over developers, that's a much bigger productivity boost than letting AI take the wheel. This is a nuance missed by almost everyone who doesn't write code or care about its quality.
Fundamentally, unit tests are using the same system to write your invariants twice, it just so happens that they're different enough that failure in one tends to reveal a bug in another.
You can't reasonably state this won't be the case with tools built for code review until the failure cases are examined.
Furthermore a simple way to help get around this is by writing code with one product while reviewing the code with another.
Is it possible that this is just the majority and there’s plenty of folks that dislike actually starting from nothing and the endless iteration to make something that works, as opposed to have some sort of a good/bad baseline to just improve upon?
I’ve seen plenty of people that are okay with picking up a codebase someone else wrote and working with the patterns and architecture in there BUT when it comes to them either needing to create new mechanisms in it or create an entirely new project/repo it’s like they hit a wall - part of it probably being friction, part not being familiar with it, as well as other reasons.
> Why did we create tools that do the fun part and increase the non-fun part? Where are the "code-review" agents at?
Presumably because that’s where the most perceived productivity gain is in. As for code review, there’s CodeRabbit, I think GitLab has their thing (Duo) and more options are popping up. Conceptually, there’s nothing preventing you from feeding a Git diff into RooCode and letting it review stuff, alongside reading whatever surrounding files it needs.
At least for me, what gives the most satisfaction (even though this kind of satisfaction happens very rarely) if I discover some very elegant structure behind whatever has to be implemented that changes the whole way how you thought about programming (oroften even about life) for decades.
Serious question: why not?
IMO it should be.
If "progress" is making us all more miserable, then what's the point? Shouldn't progress make us happier?
It feels like the endgame of AI is that the masses slave away for the profit of a few tech overlords.
For unit tests, the parts of the system that are the same are not under test, while the parts that are different are under test.
The problem with using AI to review AI is that what you're checking is the same as what you're checking it with. Checking the output of one LLM with another brand probably helps, but they may also have a lot of similarities, so it's not clear how much.
Senior developers love removing code.
Code review is probably my favorite part of the job, when there isn’t a deadline bearing down on me for my own tasks.
So I don’t really agree with your framing. Code reviews are very fun.
When I use an LLM to code I feel like I can go from idea to something I can work with in much less time than I would have normally.
Our codebase is more type-safe, better documented, and it's much easier to refactor messy code into the intended architecture.
Maybe I just have lower expectations of what these things can do but I don't expect it to problem solve. I expect it to be decent at gathering relevant context for me, at taking existing patterns and re-applying them to a different situation, and at letting me talk shit to it while I figure out what actually needs to be done.
I especially expect it to allow me to be lazy and not have to manually type out all of that code across different files when it can just generate them it in a few seconds and I can review each change as it happens.
Not me. I enjoy figuring out the requirements, the high-level design, and the clever approach that will yield high performance, or reuse of existing libraries, or whatever it is that will make it an elegant solution.
Once I've figured all that out, the actual process of writing code is a total slog. Tracking variables, remembering syntax, trying to think through every edge case, avoiding off-by-one errors. I've gone from being an architect (fun) to slapping bricks together with mortar (boring).
I'm infinitely happier if all that can be done for me, everything is broken out into testable units, the code looks plausibly correct, and the unit tests for each function cover all cases and are demonstrably correct.
Then after going back and forth between thinking about it and trying to build it a few times, after a while you discover the real solution.
Or at least that's how it's worked for me for a few decades, everyone might be different.
That's why you have short functions so you don't have to track that many variable. And use symbol completion (a standard in many editors).
> trying to think through every edge case, avoiding off-by-one errors.
That is designing, not coding. Sometimes I think of an edge case, but I'm already on a task that I'd like to finish, so I just add a TODO comment. Then at least before I submit the PR, I ripgrep the project for this keyword and other.
Sometimes the best design is done by doing. The tradeoffs become clearer when you have to actually code the solution (too much abstraction, too verbose, unwieldy,...) instead of relying on your mind (everything seems simpler)
Not all of the mistakes, they generally still have a performance ceiling less than human experts (though even this disclaimer is still simplifying), but this kind of self-critique is basically what makes the early "reasoning" models one up over simple chat models: for the first-n :END: tokens, replace with "wait" and see it attempt other solutions and pick something usually better.
I'm using it to get faster at building my own understanding of the problem, what needs to get done, and then just executing the rote steps I've already figured out.
Sometimes I get lucky and the feature is well defined enough just from the context gathering step that the implementation is literally just be hitting the enter key as I read the edits it wants to make.
Sometimes I have to interrupt it and guide it a bit more as it works.
Sometimes I realize I misunderstood something as it's thinking about what it needs to do.
One-shotting or asking the LLM to think for you is the worst way to use them.
And no, off-by-one errors and edge cases are firmly part of coding, once you're writing code inside of a function. Edge cases are not "todos", they're correctly handling all possible states.
> Sometimes the best design is done by doing.
I mean, sure go ahead and prototype, rewrite, etc. That doesn't change anything. You can have the AI do that for you too, and then you can re-evaluate and re-design. The point is, I want to be doing that evaluation and re-designing. Not typing all the code and keeping track of loop states and variable conditions and index variables and exit conditions. That stuff is boring as hell, and I've written more than enough to last a lifetime already.
Aka the scope. And the namespace of whatever you want to access. Which is a design problem.
> And it's not about symbol completion, it's about remembering all the obscure differences in built-in function names and which does what
That's what references are for. And some IDEs bring it right alongside the editor. If not, you have online and offline references. You remember them through usage and semantics.
> And no, off-by-one errors and edge cases are firmly part of coding, once you're writing code inside of a function.
It's not. You define the happy path and error cases as part of the specs. But they're generally lacking in precision (full of ambiguities) and only care about the essential complexity. The accidental complexity comes as part of the platform and is also part of the design. Pushing those kind of errors as part of coding is shortsightedness.
> Not typing all the code and keeping track of loop states and variable conditions and index variables and exit conditions. That stuff is boring as hell, and I've written more than enough to last a lifetime already
That is like saying "Not typing all the text and keeping track of words and punctuation and paragraphs and signatures. English is boring as hell and I've written more than enough..."
If you don't like formality, say so. I've never had anyone describe coding as you did. No one things about those stuff that closely. It's like a guitar player complaining about which strings to strike with a finger. Or a race driver complaining about the angle of the steering wheel and having to press the brake.
Generating 10 options with mediocre mean and some standard deviation, and then evaluating which is best, is much easier than deliberative reasoning to just get one thing right in the first place more often.
if the act of writing code is something you consider a burden rather than a joy then my friend you are in the wrong profession
The simple fact is that I find there's very little creative satisfaction to be found in writing most functions. Once you've done it 10,000 times, it's not exactly fun anymore, I mean unless you're working on some cutting-edge algorithm which is not what we're doing 99.9% of the time.
The creative part becomes in the higher level of design, where it's no longer rote. This is the whole reason why people move up into architecture roles, designing systems and libraries and API's instead of writing lines of code.
The analogies with guitar players or race car drivers or writers are flawed, because nothing they do is rote. Every note matters, every turn, every phrase. They're about creativity and/or split-second decision making.
But when you're writing code, that's just not the case. For anything that's a 10- or 20- line function, there isn't usually much creativity there, 99.99% of the time. You're just translating an idea into code in a straightforward way.
So when you say, "Developers like _writing_ and that gives the most job satisfaction." That's just not true. Especially not for many experienced devs. Developers like thinking, in my experience. They like designing, the creative part. Not the writing part. The writing is just the means to the end.
You can take the output of an LLM and feed it into another LLM and ask it to fact-check. Not surprisingly, these LLMs have a high false negative rate, meaning that it won't always catch the error. (I think you agree with me so far.) However the probability of these LLM failures are independent of each other, so long as you don't share context. The converse is that the LLM has a less-than-we-would-like probability of detecting a hallucination, but if it does then verification of that fact is reliable in future invocations.
Combine this together: you can ask an LLM to do X, for any X, then take the output and feed it into some number of validation instances to look for hallucinations, bad logic, poor understanding, whatever. What you get back on the first pass will look like a flip of the coin -- one agent claims it is hallucination, the other agent says it is correct; both give reasons. But feed those reasons into follow-up verifier prompts, and repeat. You will find that non-hallucination responses tend to persist, while hallucinations are weeded out. The stable point is the truth.
This works. I have workflows that make use of this, so I can attest to its effectiveness. The new-ish Claude Code sub-agent capabilities and slash commands are excellent for doing this, btw.
I care deeply about the code quality that goes into the projects I work on because I end up having to maintain it, review it, or fix it when it goes south, and honestly it just feels wrong to me to see bad code.
But literally typing out the characters that make up the code? I could care less. I've done that already. I can do it in my sleep, there's no challenge.
At this stage in my career I'm looking for ways to take the experience I have and upskill my teams using it.
I'd be crazy not to try and leverage LLMs as much as possible. That includes spending the time to write good CLAUDE.md files, set up custom agents that work with our codebase and patterns, it also includes taking the time to explain the why behind those choices to the team so they understand them, calling out bad PRs that "work" but are AI slop and teaching them how to get better results out of these things.
Idk man the profession is pretty big and creating software is still just as fun as when I was doing it character by character in notepad. I just don't care to type more than I need to when I can focus on problem solving and building.
A number of years ago, I wrote a caching/lookup library that is probably some of the favorite code I've ever created.
After the initial configuration, the use was elegant and there was really no reason not to use it if you needed to query anything that could be cached on the server side. Super easy to wrap just about any code with it as long as the response is serializable.
CachingCore.Instance.Get(key, cacheDuration, () => { /* expensive lookup code here */ });
Under the hood, it would check the preferred caching solution (e.g., Redis/Memcache/etc), followed by less preferred options if the preferred wasn't available, followed by the expensive lookup if it wasn't found anywhere. Defaulted to in-memory if nothing else was available.
If the data was returned from cache, it would then compare the expiration to the specified duration... If it was getting close to various configurable tolerances, it would start a new lookup in the background and update the cache (some of our lookups could take several minutes*, others just a handful of seconds).
The hardest part was making sure that we didn't cause a thundering herd type problem with looking up stuff multiple times... in-memory cache flags indicating lookups in progress so we could hold up other requests if it failed through and then let them know once it's available. While not the absolute worst case scenario, you might end up making the expensive lookups once from each of the servers that use it if the shared cache isn't available.
* most of these have a separate service running on a schedule to pre-cache the data, but things have a backup with this method.
This isn't true. Every instantiation of the LLM is different. Oversimplifying a little, but hallucination emerges when low-probability next words are selected. True explanations, on the other hand, act as attractors in state-space. Once stumbled upon, they are consistently preserved.
So run a bunch of LLM instances in parallel with the same prompt. The built-in randomness & temperature settings will ensure you get many different answers, some quite crazy. Evaluate them in new LLM instances with fresh context. In just 1-2 iterations you will hone in on state-space attractors, which are chains of reasoning well supported by the training set.
An LLM can do it in two minutes while I fetch coffee, then I can proceed to add the complex bits (if there are any)
For me, it's exactly the opposite:
I love to build things from "nothing" (if I had the possibility, I would even like to write my own kernel that is written in a novel programming language developed by me :-) ).
On the other hand, when I pick up someone else's codebase, I nearly always (if it was not written by some insanely smart programmer) immediately find it badly written. In nearly al cases I tend to be right in my judgements (my boss agrees), but I am very sensitive to bad code, and often ask myself how the programmer who wrote the original code did not yet commit seppuku, considering how much of a shame the code is.
Thus: you can in my opinion only enjoy picking up a codebase someone else wrote if you are incredibly tolerant of bad code.
The creativity in implementing (e.g an indexed array that, when it grows to large, gets reformated to a less performance hashmap) is what I imagine being lost and bring people satisfaction. Pulling that off in a clean and not in a complex way... well there is a certain reward in that. I don't have any long term proof but I also hypothesize it helps with maintainability.
But I also see your point, sometimes I need a tool that does a function and I don't care to write it and giving the agent requirements and having it implemented is enough. But typically these tools are used and discarded.
The way I see it these tools allow me to use my actual brainpower mostly on those problems. Because all the rote work can now be workably augmented away, I can choose which problems to actually focus on "by hand" as it were. I'd never give those problems to an LLM to solve. I might however ask it to search the web for papers or articles or what have you that have solved similar problems and go from there.
If someone is giving that up then I'd question why they're doing that.. No one is forcing them to.
It's the problem solving itself that is fun, the "layer" that it's in doesn't really make a difference to me.
I therefore think it makes the most sense to just feed it requirements and issues, and telling it to provide a solution.
Also unless you're starting a new project or big feature with a lot of boiler plate, in my experience it's almost never necessary to make a lot of files with a lot of text in it at once.
i don't disagree with you but if "adding one more CRUD endpoint" and similar rote tasks represent any significant amount of your engineering hours, especially in the context of business impact, then something is fundamentally broken in your team, engineering org, or company overall
time spent typing code into an editor is usually, hopefully!, approximately statistically 0% of overall engineering time