(www.oneusefulthing.org)

265 points ctoth | 2 comments | 20 Apr 25 14:55 UTC | HN request time: 0.475s | source

1. skybrian ◴[20 Apr 25 17:41 UTC] No.43745216[source]▶

What’s clear is that AI is unreliable in general and must be tested on specific tasks. That might be human review of a single output or some kind of task-specific evaluation.

It’s bad luck for those of us who want to talk about how good or bad they are in general. Summary statistics aren’t going to tell us much more than a reasonable guess as to whether a new model is worth trying on a task we actually care about.

replies(1): >>43745234 #

2. simonw ◴[20 Apr 25 17:44 UTC] No.43745234[source]▶

>>43745216 (TP) #

Right: we effectively all need our own evals for the tasks that matter to us... but writing those evals continues to be one of the least well documented areas of how to effectively use LLMs.

↑

Jagged AGI: o3, Gemini 2.5, and everything after