AI agent benchmarks are broken

What makes LLMs amazing (fuzzy input, fuzzy output) is exactly why they are hard to benchmark. If they could be benchmarked easily, they wouldn't be powerful by definition. I have no idea what's going on in the minds of people benchmarking LLMs for fuzzy tasks, and in the minds of people relying on benchmarks to make decisions about LLMs, I never looked at them. People doing benchmarks have to prove what they do is useful, not us public proving them they're doing it wrong.

Of course, for such tasks we could benchmark them :

* arithmetic (why would use LLM for that ?)

* correct JSON syntax, correct command lines etc.

* looking for specific information in a text

* looking for a missing information in a text

* language logic (ifs then elses where we know the answer in advance)

But by Goodhart's Law, LLMs that have been trained to succeed in those benchmarks might loose powerfulness in others tasks where we really need them (fuzzy inputs, fuzzy outputs)