Most active commenters

brookst(3)

AI agent benchmarks are broken

(ddkang.substack.com)

Show context

jerf ◴[11 Jul 25 13:41 UTC] No.44532037[source]▶

When I was being a bad HN reader and just reacting to the title, my initial impulse was to be placating, and observe that they are probably just immature. After all, for all that has happened, this is still only a couple year's worth of development, and it does tend to take a long time to develop good benchmarks.

However the article does seem to be pointing out some fundamental issues. I'm particularly annoyed by using LLMs to evaluate the output of LLMs. Anyone with enough experience to be writing benchmarks of this sort in the first place ought to know that's a no-go. It isn't even just using "AI to evaluate AI" per se, but using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test. As we, at the moment, lack a diversity of AI architectures that can play on the same level as LLMs, it is simply necessary for the only other known intelligence architecture, human brains, to be in the loop for now, however many other difficulties that may introduce into the testing procedures.

Tests that a "do nothing" AI can pass aren't intrinsically invalid but they should certainly be only a very small number of the tests. I'd go with low-single-digit percentage, not 38%. But I would say it should be above zero; we do want to test for the AI being excessively biased in the direction of "doing something", which is a valid failure state.

replies(9): >>44532155 #>>44532406 #>>44532411 #>>44532530 #>>44532765 #>>44532967 #>>44533182 #>>44533517 #>>44535537 #

1. jstummbillig ◴[11 Jul 25 14:48 UTC] No.44532765[source]▶

>>44532037 #

> using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test.

That's what humans do all the time. What's the fundamental difference? Or are you saying that's also broken?

replies(4): >>44532931 #>>44533017 #>>44533421 #>>44533789 #

2. qsort ◴[11 Jul 25 15:02 UTC] No.44532931[source]▶

>>44532765 (TP) #

We want machines that are better than humans, otherwise what purpose do they serve?

replies(1): >>44533165 #

3. rsynnott ◴[11 Jul 25 15:09 UTC] No.44533017[source]▶

>>44532765 (TP) #

... I mean, when evaluating "45 + 8 minutes" where the expected answer was "63 minutes", as in the article, a competent human reviewer does not go "hmm, yes, that seems plausible, it probably succeeded, give it the points".

I know LLM evangelists love this "humans make mistakes too" line, but, really, only an _exceptionally_ incompetent human evaluator would fall for that one.

replies(1): >>44536189 #

4. xnx ◴[11 Jul 25 15:20 UTC] No.44533165[source]▶

>>44532931 #

A machine with human level "AI" is still useful if it can run 24/7 and you can spin up 1M instances.

replies(2): >>44535311 #>>44536137 #

5. jerf ◴[11 Jul 25 15:40 UTC] No.44533421[source]▶

>>44532765 (TP) #

Yes, humans evaluating humans also causes human foibles to be magnified.

I cite the entire current education system. Substantiating that claim would take more than an HN comment allows, though I think most people can probably get the drift of what I'm talking about, even if we'd disagree about the details. Absolutely humans are not immune to this.

I also cite the entire concept of "fallacies", many of which are things that both human brains tend to produce and then tend to evaluate poorly. An alien species might find some of our fallacies absolutely transparent, and have entirely different fallacies of their own that none of us would find convincing in the slightest, because of fundamentally different brain architectures.

I don't think AIs are ready for this yet and I don't expect LLMs ever will be, but in the future getting an outsider perspective from them in a sort of Mixture of Experts architecture could be valuable for life decisions. (I look to the future AI architectures in which LLMs are just a component but not the whole.)

6. jacobr1 ◴[11 Jul 25 16:04 UTC] No.44533789[source]▶

>>44532765 (TP) #

The equivalent would be having the _same human_ review their own work. We require others with different experience and fresh eyes for secondary review and for the most important task multiple people.

To some extent the same llm with a new context history and different prompt is sorta like that ... but still is much weaker than using a different system entirely.

replies(1): >>44536172 #

7. einrealist ◴[11 Jul 25 18:11 UTC] No.44535311{3}[source]▶

>>44533165 #

And boil the planet.

8. fragmede ◴[11 Jul 25 19:42 UTC] No.44536137{3}[source]▶

>>44533165 #

and they don't have family that gets sick or dies or come into work hungover or go off on political tangents and cause HR issues or want to take vacations or complain about bad working conditions.

9. brookst ◴[11 Jul 25 19:46 UTC] No.44536172[source]▶

>>44533789 #

How do you feel about o3 reviewing 4o-mini?

10. brookst ◴[11 Jul 25 19:48 UTC] No.44536189[source]▶

>>44533017 #

have you ever hired human evaluators at scale? They make all sorts of mistakes. Relatively low probability, so it’s a noise factor in, but I have yet to meet the human who is 100% accurate at simple tasks done thousands of times.

replies(1): >>44537202 #

11. Jensson ◴[11 Jul 25 22:01 UTC] No.44537202{3}[source]▶

>>44536189 #

Which is why you hire them at scale as you say, then they are very reliable. LLM at scale are not.

The problem with these AI models is there is no such point where you can just scale them up and they can solve problems as accurately as a group of humans. They add too much noise and eventually go haywire when left to their own devices.

replies(1): >>44542808 #

12. brookst ◴[12 Jul 25 15:38 UTC] No.44542808{4}[source]▶

>>44537202 #

I haven’t found that to be the case. Both LLMs and humans produce outputs that cannot be blindly trusted to be accurate.

↑