←back to thread

615 points __rito__ | 1 comments | | HN request time: 0.353s | source

Related from yesterday: Show HN: Gemini Pro 3 imagines the HN front page 10 years from now - https://news.ycombinator.com/item?id=46205632
Show context
LeroyRaz ◴[] No.46223959[source]
I am surprised the author thought the project passed quality control. The LLM reviews seem mostly false.

Looking at the comment reviews on the actual website, the LLM seems to have mostly judged whether it agreed with the takes, not whether they came true, and it seems to have an incredibly poor grasp of it's actual task of accessing whether the comments were predictive or not.

The LLM's comment reviews are of often statements like "correctly characterized [program language] as [opinion]."

This dynamic means the website mostly grades people on having the most confirmist take (the take most likely to dominate the training data, and be selected for in the LLM RL tuning process of pleasing the average user).

replies(3): >>46224135 #>>46224138 #>>46224958 #
1. andy99 ◴[] No.46224958[source]
I haven’t looked at the output yet, but came here to say,LLM grading is crap. They miss things, they ignore instructions, bring in their own views, have no calibration and in general are extremely poorly suited to this task. “Good” LLM as a judge type products (and none are great) use LLMs to make binary decisions - “do these atomic facts match yes / no” type stuff - and aggregate them to get a score.

I understand this is just a fun exercise so it’s basically what LLMs are good at - generating plausible sounding stuff without regard for correctness. I would not extrapolate this to their utility on real evaluation tasks.