It doesn't look like the code anonymizes usernames when sending the thread for grading. This likely induces bias in the grades based on past/current prevailing opinions of certain users. It would be interesting to see the whole thing done again but this time randomly re-assigning usernames, to assess bias, and also with procedurally generated pseudonyms, to see whether the bias can be removed that way.
I'd expect de-biasing would deflate grades for well known users.
It might also be interesting to use a search-grounded model that provides citations for its grading claims. Gemini models have access to this via their API, for example.
replies(2):