←back to thread

AI agent benchmarks are broken

(ddkang.substack.com)
181 points neehao | 1 comments | | HN request time: 0.269s | source
Show context
ttoinou ◴[] No.44532669[source]
What makes LLMs amazing (fuzzy input, fuzzy output) is exactly why they are hard to benchmark. If they could be benchmarked easily, they wouldn't be powerful by definition. I have no idea what's going on in the minds of people benchmarking LLMs for fuzzy tasks, and in the minds of people relying on benchmarks to make decisions about LLMs, I never looked at them. People doing benchmarks have to prove what they do is useful, not us public proving them they're doing it wrong.

Of course, for such tasks we could benchmark them :

* arithmetic (why would use LLM for that ?)

* correct JSON syntax, correct command lines etc.

* looking for specific information in a text

* looking for a missing information in a text

* language logic (ifs then elses where we know the answer in advance)

But by Goodhart's Law, LLMs that have been trained to succeed in those benchmarks might loose powerfulness in others tasks where we really need them (fuzzy inputs, fuzzy outputs)

replies(2): >>44533120 #>>44536936 #
1. th0ma5 ◴[] No.44536936[source]
Since when do people like the fuzziness of outputs? I think you make an interesting point but it also seems to imply that benchmarking will never truly be possible, which I think is true unless we can also make them observable which also as you say gives up the mystique.