What’s clear is that AI is unreliable in general and must be tested on specific tasks. That might be human review of a single output or some kind of task-specific evaluation.
It’s bad luck for those of us who want to talk about how good or bad they are in general. Summary statistics aren’t going to tell us much more than a reasonable guess as to whether a new model is worth trying on a task we actually care about.
replies(1):