←back to thread

192 points imasl42 | 3 comments | | HN request time: 0.001s | source
Show context
its-kostya ◴[] No.45311805[source]
Code review is part of the job, but one of the least enjoyable parts. Developers like _writing_ and that gives the most job satisfaction. AI tools are helpful, but inherently increases the amount of code we have to review with more scrutiny than my colleagues because of how unpredictable - yet convincing - it can be. Why did we create tools that do the fun part and increase the non-fun part? Where are the "code-review" agents at?
replies(9): >>45311852 #>>45311876 #>>45311926 #>>45312027 #>>45312147 #>>45312307 #>>45312348 #>>45312499 #>>45362757 #
simonw ◴[] No.45311926[source]
> Where are the "code-review" agents at?

OpenAI's Codex Cloud just added a new feature for code review, and their new GPT-5-Codex model has been specifically trained for code review: https://openai.com/index/introducing-upgrades-to-codex/

Gemini and Claude both have code review features that work via GitHub Actions: https://developers.google.com/gemini-code-assist/docs/review... and https://docs.claude.com/en/docs/claude-code/github-actions

GitHub have their own version of this pattern too: https://github.blog/changelog/2025-04-04-copilot-code-review...

There are also a whole lot of dedicated code review startups like https://coderabbit.ai/ and https://www.greptile.com/ and https://www.qodo.ai/products/qodo-merge/

replies(1): >>45311984 #
vrighter ◴[] No.45311984[source]
you can't use a system with the exact same hallucination problem to check the work of another one just like it. Snake oil
replies(4): >>45312016 #>>45312370 #>>45313235 #>>45319240 #
1. ben_w ◴[] No.45313235[source]
Weirdly, you can not only do this, it somehow does actually catch some of its own mistakes.

Not all of the mistakes, they generally still have a performance ceiling less than human experts (though even this disclaimer is still simplifying), but this kind of self-critique is basically what makes the early "reasoning" models one up over simple chat models: for the first-n :END: tokens, replace with "wait" and see it attempt other solutions and pick something usually better.

replies(1): >>45315068 #
2. vrighter ◴[] No.45315068[source]
the "pick something usually better" sounds a lot like "and then draw the rest of the f*** owl"
replies(1): >>45315737 #
3. ben_w ◴[] No.45315737[source]
Turned out that for a lot of things (not all things, Transformers have a lot of weaknesses), using a neural network to score an output is, if not "fine", then at least "ok".

Generating 10 options with mediocre mean and some standard deviation, and then evaluating which is best, is much easier than deliberative reasoning to just get one thing right in the first place more often.