AI agent benchmarks are broken

(ddkang.substack.com)

181 points neehao | 3 comments | 11 Jul 25 13:06 UTC | HN request time: 0.616s | source

Show context

jerf ◴[11 Jul 25 13:41 UTC] No.44532037[source]▶

When I was being a bad HN reader and just reacting to the title, my initial impulse was to be placating, and observe that they are probably just immature. After all, for all that has happened, this is still only a couple year's worth of development, and it does tend to take a long time to develop good benchmarks.

However the article does seem to be pointing out some fundamental issues. I'm particularly annoyed by using LLMs to evaluate the output of LLMs. Anyone with enough experience to be writing benchmarks of this sort in the first place ought to know that's a no-go. It isn't even just using "AI to evaluate AI" per se, but using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test. As we, at the moment, lack a diversity of AI architectures that can play on the same level as LLMs, it is simply necessary for the only other known intelligence architecture, human brains, to be in the loop for now, however many other difficulties that may introduce into the testing procedures.

Tests that a "do nothing" AI can pass aren't intrinsically invalid but they should certainly be only a very small number of the tests. I'd go with low-single-digit percentage, not 38%. But I would say it should be above zero; we do want to test for the AI being excessively biased in the direction of "doing something", which is a valid failure state.

replies(9): >>44532155 #>>44532406 #>>44532411 #>>44532530 #>>44532765 #>>44532967 #>>44533182 #>>44533517 #>>44535537 #

alextheparrot ◴[11 Jul 25 14:14 UTC] No.44532406[source]▶

>>44532037 #

LLMs evaluating LLM outputs really isn’t that dire…

Discriminating good answers is easier than generating them. Good evaluations write test sets for the discriminators to show when this is or isn’t true. Evaluating the outputs as the user might see them are more representative than having your generator do multiple tasks (e.g. solve a math query and format the output as a multiple choice answer).

Also, human labels are good but have problems of their own, it isn’t like by using a “different intelligence architecture” we elide all the possible errors. Good instructions to the evaluation model often translate directly to better human results, showing a correlation between these two sources of sampling intelligence.

replies(5): >>44532598 #>>44533069 #>>44533673 #>>44533848 #>>44534579 #

1. tempfile ◴[11 Jul 25 16:09 UTC] No.44533848[source]▶

>>44532406 #

> Discriminating good answers is easier than generating them.

This is actually very wrong. Consider for instance the fact that people who grade your tests in school are typically more talented, capable, trained than the people taking the test. This is true even when an answer key exists.

> Also, human labels are good but have problems of their own,

Granted, but...

> it isn’t like by using a “different intelligence architecture” we elide all the possible errors

nobody is claiming this. We elide the specific, obvious problem that using a system to test itself gives you no reliable information. You need a control.

replies(2): >>44536083 #>>44536481 #

2. rf15 ◴[11 Jul 25 19:35 UTC] No.44536083[source]▶

>>44533848 (TP) #

Trading control for convenience has always been the tradeoff in the recent AI hype cycle and the reason why so many people like to use ChatGPT.

3. alextheparrot ◴[11 Jul 25 20:25 UTC] No.44536481[source]▶

>>44533848 (TP) #

It isn’t actually very wrong. Your example is tangential as graders in school have multiple roles — teaching the content and grading. That’s an implementation detail, not a counter to the premise.

I don’t think we should assume answering a test would be easy for a Scantron machine just because it is very good at grading them, either.

↑