Most active commenters
  • brookst(3)

←back to thread

AI agent benchmarks are broken

(ddkang.substack.com)
181 points neehao | 12 comments | | HN request time: 0.643s | source | bottom
Show context
jerf ◴[] No.44532037[source]
When I was being a bad HN reader and just reacting to the title, my initial impulse was to be placating, and observe that they are probably just immature. After all, for all that has happened, this is still only a couple year's worth of development, and it does tend to take a long time to develop good benchmarks.

However the article does seem to be pointing out some fundamental issues. I'm particularly annoyed by using LLMs to evaluate the output of LLMs. Anyone with enough experience to be writing benchmarks of this sort in the first place ought to know that's a no-go. It isn't even just using "AI to evaluate AI" per se, but using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test. As we, at the moment, lack a diversity of AI architectures that can play on the same level as LLMs, it is simply necessary for the only other known intelligence architecture, human brains, to be in the loop for now, however many other difficulties that may introduce into the testing procedures.

Tests that a "do nothing" AI can pass aren't intrinsically invalid but they should certainly be only a very small number of the tests. I'd go with low-single-digit percentage, not 38%. But I would say it should be above zero; we do want to test for the AI being excessively biased in the direction of "doing something", which is a valid failure state.

replies(9): >>44532155 #>>44532406 #>>44532411 #>>44532530 #>>44532765 #>>44532967 #>>44533182 #>>44533517 #>>44535537 #
1. jstummbillig ◴[] No.44532765[source]
> using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test.

That's what humans do all the time. What's the fundamental difference? Or are you saying that's also broken?

replies(4): >>44532931 #>>44533017 #>>44533421 #>>44533789 #
2. qsort ◴[] No.44532931[source]
We want machines that are better than humans, otherwise what purpose do they serve?
replies(1): >>44533165 #
3. rsynnott ◴[] No.44533017[source]
... I mean, when evaluating "45 + 8 minutes" where the expected answer was "63 minutes", as in the article, a competent human reviewer does not go "hmm, yes, that seems plausible, it probably succeeded, give it the points".

I know LLM evangelists love this "humans make mistakes too" line, but, really, only an _exceptionally_ incompetent human evaluator would fall for that one.

replies(1): >>44536189 #
4. xnx ◴[] No.44533165[source]
A machine with human level "AI" is still useful if it can run 24/7 and you can spin up 1M instances.
replies(2): >>44535311 #>>44536137 #
5. jerf ◴[] No.44533421[source]
Yes, humans evaluating humans also causes human foibles to be magnified.

I cite the entire current education system. Substantiating that claim would take more than an HN comment allows, though I think most people can probably get the drift of what I'm talking about, even if we'd disagree about the details. Absolutely humans are not immune to this.

I also cite the entire concept of "fallacies", many of which are things that both human brains tend to produce and then tend to evaluate poorly. An alien species might find some of our fallacies absolutely transparent, and have entirely different fallacies of their own that none of us would find convincing in the slightest, because of fundamentally different brain architectures.

I don't think AIs are ready for this yet and I don't expect LLMs ever will be, but in the future getting an outsider perspective from them in a sort of Mixture of Experts architecture could be valuable for life decisions. (I look to the future AI architectures in which LLMs are just a component but not the whole.)

6. jacobr1 ◴[] No.44533789[source]
The equivalent would be having the _same human_ review their own work. We require others with different experience and fresh eyes for secondary review and for the most important task multiple people.

To some extent the same llm with a new context history and different prompt is sorta like that ... but still is much weaker than using a different system entirely.

replies(1): >>44536172 #
7. einrealist ◴[] No.44535311{3}[source]
And boil the planet.
8. fragmede ◴[] No.44536137{3}[source]
and they don't have family that gets sick or dies or come into work hungover or go off on political tangents and cause HR issues or want to take vacations or complain about bad working conditions.
9. brookst ◴[] No.44536172[source]
How do you feel about o3 reviewing 4o-mini?
10. brookst ◴[] No.44536189[source]
have you ever hired human evaluators at scale? They make all sorts of mistakes. Relatively low probability, so it’s a noise factor in, but I have yet to meet the human who is 100% accurate at simple tasks done thousands of times.
replies(1): >>44537202 #
11. Jensson ◴[] No.44537202{3}[source]
Which is why you hire them at scale as you say, then they are very reliable. LLM at scale are not.

The problem with these AI models is there is no such point where you can just scale them up and they can solve problems as accurately as a group of humans. They add too much noise and eventually go haywire when left to their own devices.

replies(1): >>44542808 #
12. brookst ◴[] No.44542808{4}[source]
I haven’t found that to be the case. Both LLMs and humans produce outputs that cannot be blindly trusted to be accurate.