←back to thread

AI agent benchmarks are broken

(ddkang.substack.com)
181 points neehao | 2 comments | | HN request time: 0.643s | source
Show context
jerf ◴[] No.44532037[source]
When I was being a bad HN reader and just reacting to the title, my initial impulse was to be placating, and observe that they are probably just immature. After all, for all that has happened, this is still only a couple year's worth of development, and it does tend to take a long time to develop good benchmarks.

However the article does seem to be pointing out some fundamental issues. I'm particularly annoyed by using LLMs to evaluate the output of LLMs. Anyone with enough experience to be writing benchmarks of this sort in the first place ought to know that's a no-go. It isn't even just using "AI to evaluate AI" per se, but using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test. As we, at the moment, lack a diversity of AI architectures that can play on the same level as LLMs, it is simply necessary for the only other known intelligence architecture, human brains, to be in the loop for now, however many other difficulties that may introduce into the testing procedures.

Tests that a "do nothing" AI can pass aren't intrinsically invalid but they should certainly be only a very small number of the tests. I'd go with low-single-digit percentage, not 38%. But I would say it should be above zero; we do want to test for the AI being excessively biased in the direction of "doing something", which is a valid failure state.

replies(9): >>44532155 #>>44532406 #>>44532411 #>>44532530 #>>44532765 #>>44532967 #>>44533182 #>>44533517 #>>44535537 #
DonHopkins ◴[] No.44533517[source]
It's like using steel to produce steel. What else are you going to use? Bamboo?
replies(2): >>44533642 #>>44534526 #
1. dmbche ◴[] No.44533642[source]
I'm not sure if I'm dense, but we don't use steel to make steel (whether crucibles or "feed material").

The first person to make steel made it without steel didn't they?

Did I miss something?

Edit0: fun tidbit - Wootz steel was made with crucibles of clay with rice husks mixed in (husks would carbonize quickly and introduce air layers to better isolate) and many seemingly random objects (fruits, vegetation) were added to the crucible to control carbon content.

I higly recommend A Collection of Unmitigated Pedantry's series on steel (it's a blog, just search "ACOUP steel".

replies(1): >>44536194 #
2. dmbche ◴[] No.44536194[source]
Second fun tidbit : Bamboo was used as the fuel source in some furnaces - they did indeed use bamboo like the parent comment mentionned.