Most active commenters
  • adastra22(3)

←back to thread

192 points imasl42 | 12 comments | | HN request time: 0s | source | bottom
Show context
its-kostya ◴[] No.45311805[source]
Code review is part of the job, but one of the least enjoyable parts. Developers like _writing_ and that gives the most job satisfaction. AI tools are helpful, but inherently increases the amount of code we have to review with more scrutiny than my colleagues because of how unpredictable - yet convincing - it can be. Why did we create tools that do the fun part and increase the non-fun part? Where are the "code-review" agents at?
replies(9): >>45311852 #>>45311876 #>>45311926 #>>45312027 #>>45312147 #>>45312307 #>>45312348 #>>45312499 #>>45362757 #
simonw ◴[] No.45311926[source]
> Where are the "code-review" agents at?

OpenAI's Codex Cloud just added a new feature for code review, and their new GPT-5-Codex model has been specifically trained for code review: https://openai.com/index/introducing-upgrades-to-codex/

Gemini and Claude both have code review features that work via GitHub Actions: https://developers.google.com/gemini-code-assist/docs/review... and https://docs.claude.com/en/docs/claude-code/github-actions

GitHub have their own version of this pattern too: https://github.blog/changelog/2025-04-04-copilot-code-review...

There are also a whole lot of dedicated code review startups like https://coderabbit.ai/ and https://www.greptile.com/ and https://www.qodo.ai/products/qodo-merge/

replies(1): >>45311984 #
1. vrighter ◴[] No.45311984[source]
you can't use a system with the exact same hallucination problem to check the work of another one just like it. Snake oil
replies(4): >>45312016 #>>45312370 #>>45313235 #>>45319240 #
2. bcrosby95 ◴[] No.45312016[source]
I don't think it's that simple.

Fundamentally, unit tests are using the same system to write your invariants twice, it just so happens that they're different enough that failure in one tends to reveal a bug in another.

You can't reasonably state this won't be the case with tools built for code review until the failure cases are examined.

Furthermore a simple way to help get around this is by writing code with one product while reviewing the code with another.

replies(1): >>45312268 #
3. jmull ◴[] No.45312268[source]
> unit tests are using the same system to write your invariants twice

For unit tests, the parts of the system that are the same are not under test, while the parts that are different are under test.

The problem with using AI to review AI is that what you're checking is the same as what you're checking it with. Checking the output of one LLM with another brand probably helps, but they may also have a lot of similarities, so it's not clear how much.

replies(3): >>45313098 #>>45317952 #>>45325316 #
4. simonw ◴[] No.45312370[source]
It's snake oil that works surprisingly well.
5. Demiurge ◴[] No.45313098{3}[source]
What if you use a different AI model? Sometimes just a different seed generates a different result. I notice there is a benefit to seeing and contrasting the different answers. The improvement is gradual, it’s not a binary.
replies(1): >>45319250 #
6. ben_w ◴[] No.45313235[source]
Weirdly, you can not only do this, it somehow does actually catch some of its own mistakes.

Not all of the mistakes, they generally still have a performance ceiling less than human experts (though even this disclaimer is still simplifying), but this kind of self-critique is basically what makes the early "reasoning" models one up over simple chat models: for the first-n :END: tokens, replace with "wait" and see it attempt other solutions and pick something usually better.

replies(1): >>45315068 #
7. vrighter ◴[] No.45315068[source]
the "pick something usually better" sounds a lot like "and then draw the rest of the f*** owl"
replies(1): >>45315737 #
8. ben_w ◴[] No.45315737{3}[source]
Turned out that for a lot of things (not all things, Transformers have a lot of weaknesses), using a neural network to score an output is, if not "fine", then at least "ok".

Generating 10 options with mediocre mean and some standard deviation, and then evaluating which is best, is much easier than deliberative reasoning to just get one thing right in the first place more often.

9. bcrosby95 ◴[] No.45317952{3}[source]
The system is the human writing the code.
10. adastra22 ◴[] No.45319240[source]
Yes you can, and this shouldn't be surprising.

You can take the output of an LLM and feed it into another LLM and ask it to fact-check. Not surprisingly, these LLMs have a high false negative rate, meaning that it won't always catch the error. (I think you agree with me so far.) However the probability of these LLM failures are independent of each other, so long as you don't share context. The converse is that the LLM has a less-than-we-would-like probability of detecting a hallucination, but if it does then verification of that fact is reliable in future invocations.

Combine this together: you can ask an LLM to do X, for any X, then take the output and feed it into some number of validation instances to look for hallucinations, bad logic, poor understanding, whatever. What you get back on the first pass will look like a flip of the coin -- one agent claims it is hallucination, the other agent says it is correct; both give reasons. But feed those reasons into follow-up verifier prompts, and repeat. You will find that non-hallucination responses tend to persist, while hallucinations are weeded out. The stable point is the truth.

This works. I have workflows that make use of this, so I can attest to its effectiveness. The new-ish Claude Code sub-agent capabilities and slash commands are excellent for doing this, btw.

11. adastra22 ◴[] No.45319250{4}[source]
You don't need to use a different model, generally. In my experience a fresh context window is all you need, the vast majority of the time.
12. adastra22 ◴[] No.45325316{3}[source]
> The problem with using AI to review AI is that what you're checking is the same as what you're checking it with.

This isn't true. Every instantiation of the LLM is different. Oversimplifying a little, but hallucination emerges when low-probability next words are selected. True explanations, on the other hand, act as attractors in state-space. Once stumbled upon, they are consistently preserved.

So run a bunch of LLM instances in parallel with the same prompt. The built-in randomness & temperature settings will ensure you get many different answers, some quite crazy. Evaluate them in new LLM instances with fresh context. In just 1-2 iterations you will hone in on state-space attractors, which are chains of reasoning well supported by the training set.