←back to thread

AI agent benchmarks are broken

(ddkang.substack.com)
181 points neehao | 3 comments | | HN request time: 0.535s | source
Show context
jerf ◴[] No.44532037[source]
When I was being a bad HN reader and just reacting to the title, my initial impulse was to be placating, and observe that they are probably just immature. After all, for all that has happened, this is still only a couple year's worth of development, and it does tend to take a long time to develop good benchmarks.

However the article does seem to be pointing out some fundamental issues. I'm particularly annoyed by using LLMs to evaluate the output of LLMs. Anyone with enough experience to be writing benchmarks of this sort in the first place ought to know that's a no-go. It isn't even just using "AI to evaluate AI" per se, but using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test. As we, at the moment, lack a diversity of AI architectures that can play on the same level as LLMs, it is simply necessary for the only other known intelligence architecture, human brains, to be in the loop for now, however many other difficulties that may introduce into the testing procedures.

Tests that a "do nothing" AI can pass aren't intrinsically invalid but they should certainly be only a very small number of the tests. I'd go with low-single-digit percentage, not 38%. But I would say it should be above zero; we do want to test for the AI being excessively biased in the direction of "doing something", which is a valid failure state.

replies(9): >>44532155 #>>44532406 #>>44532411 #>>44532530 #>>44532765 #>>44532967 #>>44533182 #>>44533517 #>>44535537 #
jstummbillig ◴[] No.44532765[source]
> using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test.

That's what humans do all the time. What's the fundamental difference? Or are you saying that's also broken?

replies(4): >>44532931 #>>44533017 #>>44533421 #>>44533789 #
qsort ◴[] No.44532931[source]
We want machines that are better than humans, otherwise what purpose do they serve?
replies(1): >>44533165 #
1. xnx ◴[] No.44533165[source]
A machine with human level "AI" is still useful if it can run 24/7 and you can spin up 1M instances.
replies(2): >>44535311 #>>44536137 #
2. einrealist ◴[] No.44535311[source]
And boil the planet.
3. fragmede ◴[] No.44536137[source]
and they don't have family that gets sick or dies or come into work hungover or go off on political tangents and cause HR issues or want to take vacations or complain about bad working conditions.