Auto-grading decade-old Hacker News discussions with hindsight

I am surprised the author thought the project passed quality control. The LLM reviews seem mostly false.

Looking at the comment reviews on the actual website, the LLM seems to have mostly judged whether it agreed with the takes, not whether they came true, and it seems to have an incredibly poor grasp of it's actual task of accessing whether the comments were predictive or not.

The LLM's comment reviews are of often statements like "correctly characterized [program language] as [opinion]."

This dynamic means the website mostly grades people on having the most confirmist take (the take most likely to dominate the training data, and be selected for in the LLM RL tuning process of pleasing the average user).

Examples: tptacek gets an 'A' for his comment on DF which the LLM claiming that the user "captured DF's unforgiving nature, where 'can't do x or it crashes is just another feature to learn' which remained true until it was fixed on ..."

Link to LLM review: https://karpathy.ai/hncapsule/2015-12-02/index.html#article-....

So the LLM is praising a comment as describing DF as unforgiving (a characterization of the present then, not a statement about the future). And worse, it seems like tptacek may in fact be implying the opposite of the future (e.g., x will continue to crash when it was eventually fixed.)

Here is the original comment: " tptacek on Dec 2, 2015 | root | parent | next [–]

If you're not the kind of person who can take flaws like crashes or game-stopping frame-rate issues and work them into your gameplay, DF is not the game for you. It isn't a friendly game. It can take hours just to figure out how to do core game tasks. "Don't do this thing that crashes the game" is just another task to learn."

Note: I am paraphrasing the LLM review, as the website is also poorly designed, with one unable to select the text of the LLM review!

N.b., this choice of comment review is not overly cherry picked. I just scanned the "best commentators" and tptacek was number two, with this particular egregiously unrelated-to-prediction LLM summary given as justifying his #2 rating.