Interesting experiment. Using modern LLMs to retroactively grade decade-old HN discussions is a clever way to measure how well our collective predictions age. It’s impressive how little time and compute it now takes to analyze something that would’ve required days of manual reading. My only caution is that hindsight grading can overvalue outcomes instead of reasoning — good reasoning can still lead to wrong predictions. But as a tool for calibrating forecasting and identifying real signal in discussions, this is a very cool direction.