AI agent benchmarks are broken

(ddkang.substack.com)

181 points neehao | 2 comments | 11 Jul 25 13:06 UTC | HN request time: 0.479s | source

Show context

jerf ◴[11 Jul 25 13:41 UTC] No.44532037[source]▶

When I was being a bad HN reader and just reacting to the title, my initial impulse was to be placating, and observe that they are probably just immature. After all, for all that has happened, this is still only a couple year's worth of development, and it does tend to take a long time to develop good benchmarks.

However the article does seem to be pointing out some fundamental issues. I'm particularly annoyed by using LLMs to evaluate the output of LLMs. Anyone with enough experience to be writing benchmarks of this sort in the first place ought to know that's a no-go. It isn't even just using "AI to evaluate AI" per se, but using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test. As we, at the moment, lack a diversity of AI architectures that can play on the same level as LLMs, it is simply necessary for the only other known intelligence architecture, human brains, to be in the loop for now, however many other difficulties that may introduce into the testing procedures.

Tests that a "do nothing" AI can pass aren't intrinsically invalid but they should certainly be only a very small number of the tests. I'd go with low-single-digit percentage, not 38%. But I would say it should be above zero; we do want to test for the AI being excessively biased in the direction of "doing something", which is a valid failure state.

replies(9): >>44532155 #>>44532406 #>>44532411 #>>44532530 #>>44532765 #>>44532967 #>>44533182 #>>44533517 #>>44535537 #

sdenton4 ◴[11 Jul 25 14:15 UTC] No.44532411[source]▶

>>44532037 #

When I was working in audio compression, evaluation was very painful because we had no programmatic way to measure how good some reconstructed audio sounds to a human. Any metric you could come up with was gameable, and direct optimization would lead to artifacts.

As a result, we always had a two-step evaluation process. We would use a suite of metrics to guide development progress (validation), but the final evaluation reported in a paper always involved subjective human listening experiments. This was expensive, but the only way to show that the codecs were actually improving.

Similarly, here it seems fine to use LLMs to judge your work in progress, but we should be requiring human evaluation for 'final' results.

replies(2): >>44532690 #>>44533710 #

ttoinou ◴[11 Jul 25 14:42 UTC] No.44532690[source]▶

>>44532411 #

Wouldn't that process avoid you finding a better subjective audio codec that doesn't reduce typical metrics (PSNR etc.) ? another process would rather be to first construct a metric software that tries to be similar to the subjective experience of humans, then use that to create audio codecs optimizing this metric

replies(2): >>44533413 #>>44534636 #

1. sdenton4 ◴[11 Jul 25 17:09 UTC] No.44534636[source]▶

>>44532690 #

There's two answers to that....

The first is, how do you know the subjective optimization your making is actually any good? You're just moving the problem back one layer of abstraction.

The second is, we did that, eventually, by training models to predict subjective listening scores from the giant pile of subjective test data we had collected over the years. (ViSQoL) It's great, but we still don't trust it for end-of-the-day, cross codec comparison, because we don't want to reward overfit on the trained model.

https://arxiv.org/abs/2004.09584

replies(1): >>44535052 #

2. ttoinou ◴[11 Jul 25 17:45 UTC] No.44535052[source]▶

>>44534636 (TP) #

Nice

Well yeah you would still need human testing

↑