The highest quality codebase

(gricha.dev)

559 points Gricha | 2 comments | 08 Dec 25 21:33 UTC | HN request time: 0.001s | source

Show context

postalcoder ◴[11 Dec 25 16:08 UTC] No.46233143[source]▶

>>46197930 (OP) #

One of my favorite personal evals for llms is testing its stability as a reviewer.

The basic gist of it is to give the llm some code to review and have it assign a grade multiple times. How much variance is there in the grade?

Then, prompt the same llm to be a "critical" reviewer with the same code multiple times. How much does that average critical grade change?

A low variance of grades across many generations and a low delta between "review this code" and "review this code with a critical eye" is a major positive signal for quality.

I've found that gpt-5.1 produces remarkably stable evaluations whereas Claude is all over the place. Furthermore, Claude will completely [and comically] change the tenor of its evaluation when asked to be critical whereas gpt-5.1 is directionally the same while tightening the screws.

You could also interpret these results to be a proxy for obsequiousness.

Edit: One major part of the eval i left out is "can an llm converge on an 'A'?" Let's say the llm gives the code a 6/10 (or B-). When you implement its suggestions and then provide the improved code in a new context, does the grade go up? Furthermore, can it eventually give itself an A, and consistently?

It's honestly impressive how good, stable, and convergent gpt-5.1 is. Claude is not great. I have yet to test it on Gemini 3.

replies(4): >>46233792 #>>46233975 #>>46234427 #>>46234966 #

OsrsNeedsf2P ◴[11 Dec 25 17:36 UTC] No.46234427[source]▶

>>46233143 #

How is this different than testing the temperature?

replies(2): >>46235040 #>>46238450 #

smt88 ◴[11 Dec 25 18:27 UTC] No.46235040[source]▶

>>46234427 #

It isn't, and it reflects how deeply LLMs are misunderstood, even by technical people

replies(3): >>46241319 #>>46241466 #>>46241736 #

1. swid ◴[12 Dec 25 06:05 UTC] No.46241319{3}[source]▶

>>46235040 #

It surely is different. If you set the temp to 0 and do the test with slightly different wording, there is no guarantee at all the scores would be consistent.

And if an LLM is consistent, even with a high temp, it could give the same PR the same grade while choosing different words to say.

The tokens are still chosen from the distribution, so a higher probability of the same grade will result in the same grade being chosen regardless of the temp set.

replies(1): >>46243038 #

2. smt88 ◴[12 Dec 25 11:18 UTC] No.46243038[source]▶

>>46241319 (TP) #

I think you're restating (in a longer and more accurate way) what I understood the original criticism to be, that this grading test isn't testing what's it's supposed to, partly because a grade is too few tokens.

The model could "assess" the code qualitatively the same and still give slightly different letter grades.

↑