AI agent benchmarks are broken

(ddkang.substack.com)

181 points neehao | 2 comments | 11 Jul 25 13:06 UTC | HN request time: 1.14s | source

Show context

jerf ◴[11 Jul 25 13:41 UTC] No.44532037[source]▶

When I was being a bad HN reader and just reacting to the title, my initial impulse was to be placating, and observe that they are probably just immature. After all, for all that has happened, this is still only a couple year's worth of development, and it does tend to take a long time to develop good benchmarks.

However the article does seem to be pointing out some fundamental issues. I'm particularly annoyed by using LLMs to evaluate the output of LLMs. Anyone with enough experience to be writing benchmarks of this sort in the first place ought to know that's a no-go. It isn't even just using "AI to evaluate AI" per se, but using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test. As we, at the moment, lack a diversity of AI architectures that can play on the same level as LLMs, it is simply necessary for the only other known intelligence architecture, human brains, to be in the loop for now, however many other difficulties that may introduce into the testing procedures.

Tests that a "do nothing" AI can pass aren't intrinsically invalid but they should certainly be only a very small number of the tests. I'd go with low-single-digit percentage, not 38%. But I would say it should be above zero; we do want to test for the AI being excessively biased in the direction of "doing something", which is a valid failure state.

replies(9): >>44532155 #>>44532406 #>>44532411 #>>44532530 #>>44532765 #>>44532967 #>>44533182 #>>44533517 #>>44535537 #

1. xnx ◴[11 Jul 25 15:21 UTC] No.44533182[source]▶

>>44532037 #

> I'm particularly annoyed by using LLMs to evaluate the output of LLMs

This does seem a little crazy on its face, but it is yielding useful and improving tools.

replies(1): >>44533529 #

2. jerf ◴[11 Jul 25 15:47 UTC] No.44533529[source]▶

>>44533182 (TP) #

It's not about it being crazy and it's not about personal opinions about AI. It's about chaos mathematics. Iterating with the same system like that has certain easy-to-understand failure states. It's why I phrased it specifically in terms of using the same architecture to validate itself. If we had two radically different AI architectures that were capable of evaluating each other, firing them at each other for evaluation purposes would be much, much less susceptible to this sort of problem than firing either of them at themselves. That will never be a good idea.

See also a cousin comment of mine observing that human brains are absolutely susceptible to the same effect. We're just so used to it that it is the water we swim through. (And arguably human brains are more diverse than current AI systems functioning at this level. No bet on how long that will be true for, though.)

Such composite systems would still have their own characteristics and certainly wouldn't be guaranteed to be perfect or anything, but at least they would not tend to iteratively magnify their own individual flaws.

Perhaps someday we will have such diverse architectures. We don't today have anything that can evaluate LLMs other than human brains, though.

↑