←back to thread

559 points Gricha | 2 comments | | HN request time: 0.449s | source
Show context
postalcoder ◴[] No.46233143[source]
One of my favorite personal evals for llms is testing its stability as a reviewer.

The basic gist of it is to give the llm some code to review and have it assign a grade multiple times. How much variance is there in the grade?

Then, prompt the same llm to be a "critical" reviewer with the same code multiple times. How much does that average critical grade change?

A low variance of grades across many generations and a low delta between "review this code" and "review this code with a critical eye" is a major positive signal for quality.

I've found that gpt-5.1 produces remarkably stable evaluations whereas Claude is all over the place. Furthermore, Claude will completely [and comically] change the tenor of its evaluation when asked to be critical whereas gpt-5.1 is directionally the same while tightening the screws.

You could also interpret these results to be a proxy for obsequiousness.

Edit: One major part of the eval i left out is "can an llm converge on an 'A'?" Let's say the llm gives the code a 6/10 (or B-). When you implement its suggestions and then provide the improved code in a new context, does the grade go up? Furthermore, can it eventually give itself an A, and consistently?

It's honestly impressive how good, stable, and convergent gpt-5.1 is. Claude is not great. I have yet to test it on Gemini 3.

replies(4): >>46233792 #>>46233975 #>>46234427 #>>46234966 #
1. guluarte ◴[] No.46233792[source]
my experience reviewing pr is that sometimes it says it is perfect with some nipicks and other times the same pr that it is trash and need a lot of work
replies(1): >>46233901 #
2. ◴[] No.46233901[source]