AI agent benchmarks are broken

(ddkang.substack.com)

181 points neehao | 1 comments | 11 Jul 25 13:06 UTC | HN request time: 0.441s | source

Show context

jerf ◴[11 Jul 25 13:41 UTC] No.44532037[source]▶

When I was being a bad HN reader and just reacting to the title, my initial impulse was to be placating, and observe that they are probably just immature. After all, for all that has happened, this is still only a couple year's worth of development, and it does tend to take a long time to develop good benchmarks.

However the article does seem to be pointing out some fundamental issues. I'm particularly annoyed by using LLMs to evaluate the output of LLMs. Anyone with enough experience to be writing benchmarks of this sort in the first place ought to know that's a no-go. It isn't even just using "AI to evaluate AI" per se, but using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test. As we, at the moment, lack a diversity of AI architectures that can play on the same level as LLMs, it is simply necessary for the only other known intelligence architecture, human brains, to be in the loop for now, however many other difficulties that may introduce into the testing procedures.

Tests that a "do nothing" AI can pass aren't intrinsically invalid but they should certainly be only a very small number of the tests. I'd go with low-single-digit percentage, not 38%. But I would say it should be above zero; we do want to test for the AI being excessively biased in the direction of "doing something", which is a valid failure state.

replies(9): >>44532155 #>>44532406 #>>44532411 #>>44532530 #>>44532765 #>>44532967 #>>44533182 #>>44533517 #>>44535537 #

potatolicious ◴[11 Jul 25 13:52 UTC] No.44532155[source]▶

>>44532037 #

> "I'm particularly annoyed by using LLMs to evaluate the output of LLMs."

+1, and IMO part of a general trend where we're just not serious about making sure this shit works. Higher scores make stonks go up, who cares if it actually leads to reliably working products.

But also more importantly it's starting to expose the fact that we haven't solved one of ML's core challenges: data collection and curation. On the training side we have obviated this somewhat (by ingesting the whole internet, for example), but on the eval side it feels like we're increasing just going "actually constructing rigorous evaluation data, especially at this scale, would be very expensive... so let's not".

I was at a local tech meetup recently where a recruiting firm was proudly showing off the LLM-based system they're using to screen candidates. They... did not evaluate the end-to-end efficacy of their system. At all. This seems like a theme within our industry - we're deploying these systems based purely on vibes without any real quantification of efficacy.

Or in this case, we're quantifying efficacy... poorly.

replies(1): >>44533045 #

1. rsynnott ◴[11 Jul 25 15:11 UTC] No.44533045[source]▶

>>44532155 #

> +1, and IMO part of a general trend where we're just not serious about making sure this shit works.

I suspect quite a lot of the industry is actively _opposed_ to that, because it could be damaging for the "this changes everything" narrative.

↑