←back to thread

265 points ctoth | 2 comments | | HN request time: 0.477s | source
1. skybrian ◴[] No.43745216[source]
What’s clear is that AI is unreliable in general and must be tested on specific tasks. That might be human review of a single output or some kind of task-specific evaluation.

It’s bad luck for those of us who want to talk about how good or bad they are in general. Summary statistics aren’t going to tell us much more than a reasonable guess as to whether a new model is worth trying on a task we actually care about.

replies(1): >>43745234 #
2. simonw ◴[] No.43745234[source]
Right: we effectively all need our own evals for the tasks that matter to us... but writing those evals continues to be one of the least well documented areas of how to effectively use LLMs.