←back to thread

576 points Gricha | 7 comments | | HN request time: 0.231s | source | bottom
Show context
postalcoder ◴[] No.46233143[source]
One of my favorite personal evals for llms is testing its stability as a reviewer.

The basic gist of it is to give the llm some code to review and have it assign a grade multiple times. How much variance is there in the grade?

Then, prompt the same llm to be a "critical" reviewer with the same code multiple times. How much does that average critical grade change?

A low variance of grades across many generations and a low delta between "review this code" and "review this code with a critical eye" is a major positive signal for quality.

I've found that gpt-5.1 produces remarkably stable evaluations whereas Claude is all over the place. Furthermore, Claude will completely [and comically] change the tenor of its evaluation when asked to be critical whereas gpt-5.1 is directionally the same while tightening the screws.

You could also interpret these results to be a proxy for obsequiousness.

Edit: One major part of the eval i left out is "can an llm converge on an 'A'?" Let's say the llm gives the code a 6/10 (or B-). When you implement its suggestions and then provide the improved code in a new context, does the grade go up? Furthermore, can it eventually give itself an A, and consistently?

It's honestly impressive how good, stable, and convergent gpt-5.1 is. Claude is not great. I have yet to test it on Gemini 3.

replies(4): >>46233792 #>>46233975 #>>46234427 #>>46234966 #
1. OsrsNeedsf2P ◴[] No.46234427[source]
How is this different than testing the temperature?
replies(2): >>46235040 #>>46238450 #
2. smt88 ◴[] No.46235040[source]
It isn't, and it reflects how deeply LLMs are misunderstood, even by technical people
replies(3): >>46241319 #>>46241466 #>>46241736 #
3. itishappy ◴[] No.46238450[source]
How does temperature explain the variance in response to the inclusion of the word "critical"?
4. swid ◴[] No.46241319[source]
It surely is different. If you set the temp to 0 and do the test with slightly different wording, there is no guarantee at all the scores would be consistent.

And if an LLM is consistent, even with a high temp, it could give the same PR the same grade while choosing different words to say.

The tokens are still chosen from the distribution, so a higher probability of the same grade will result in the same grade being chosen regardless of the temp set.

replies(1): >>46243038 #
5. stevenhuang ◴[] No.46241466[source]
The irony is strong here.
6. postalcoder ◴[] No.46241736[source]
gpt-5* reasoning models do not have an adjustable temperature parameter. It seems like we may have a different understanding of these models.

And, like the other commenter said, the temperature may change the distribution of the next token, but the reasoning tends to reel those things in, which is why reasoning models are notoriously poor at creative writing.

You are free to run these experiments for yourself. Perhaps, with your deeper understanding, you'll shed new light on this behavior.

7. smt88 ◴[] No.46243038{3}[source]
I think you're restating (in a longer and more accurate way) what I understood the original criticism to be, that this grading test isn't testing what's it's supposed to, partly because a grade is too few tokens.

The model could "assess" the code qualitatively the same and still give slightly different letter grades.