AI agent benchmarks are broken

(ddkang.substack.com)

181 points neehao | 3 comments | 11 Jul 25 13:06 UTC | HN request time: 0.949s | source

Show context

jerf ◴[11 Jul 25 13:41 UTC] No.44532037[source]▶

When I was being a bad HN reader and just reacting to the title, my initial impulse was to be placating, and observe that they are probably just immature. After all, for all that has happened, this is still only a couple year's worth of development, and it does tend to take a long time to develop good benchmarks.

However the article does seem to be pointing out some fundamental issues. I'm particularly annoyed by using LLMs to evaluate the output of LLMs. Anyone with enough experience to be writing benchmarks of this sort in the first place ought to know that's a no-go. It isn't even just using "AI to evaluate AI" per se, but using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test. As we, at the moment, lack a diversity of AI architectures that can play on the same level as LLMs, it is simply necessary for the only other known intelligence architecture, human brains, to be in the loop for now, however many other difficulties that may introduce into the testing procedures.

Tests that a "do nothing" AI can pass aren't intrinsically invalid but they should certainly be only a very small number of the tests. I'd go with low-single-digit percentage, not 38%. But I would say it should be above zero; we do want to test for the AI being excessively biased in the direction of "doing something", which is a valid failure state.

replies(9): >>44532155 #>>44532406 #>>44532411 #>>44532530 #>>44532765 #>>44532967 #>>44533182 #>>44533517 #>>44535537 #

jstummbillig ◴[11 Jul 25 14:48 UTC] No.44532765[source]▶

>>44532037 #

> using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test.

That's what humans do all the time. What's the fundamental difference? Or are you saying that's also broken?

replies(4): >>44532931 #>>44533017 #>>44533421 #>>44533789 #

rsynnott ◴[11 Jul 25 15:09 UTC] No.44533017[source]▶

>>44532765 #

... I mean, when evaluating "45 + 8 minutes" where the expected answer was "63 minutes", as in the article, a competent human reviewer does not go "hmm, yes, that seems plausible, it probably succeeded, give it the points".

I know LLM evangelists love this "humans make mistakes too" line, but, really, only an _exceptionally_ incompetent human evaluator would fall for that one.

replies(1): >>44536189 #

1. brookst ◴[11 Jul 25 19:48 UTC] No.44536189[source]▶

>>44533017 #

have you ever hired human evaluators at scale? They make all sorts of mistakes. Relatively low probability, so it’s a noise factor in, but I have yet to meet the human who is 100% accurate at simple tasks done thousands of times.

replies(1): >>44537202 #

2. Jensson ◴[11 Jul 25 22:01 UTC] No.44537202[source]▶

>>44536189 (TP) #

Which is why you hire them at scale as you say, then they are very reliable. LLM at scale are not.

The problem with these AI models is there is no such point where you can just scale them up and they can solve problems as accurately as a group of humans. They add too much noise and eventually go haywire when left to their own devices.

replies(1): >>44542808 #

3. brookst ◴[12 Jul 25 15:38 UTC] No.44542808[source]▶

>>44537202 #

I haven’t found that to be the case. Both LLMs and humans produce outputs that cannot be blindly trusted to be accurate.

↑