AI agent benchmarks are broken

(ddkang.substack.com)

181 points neehao | 2 comments | 11 Jul 25 13:06 UTC | HN request time: 0.531s | source

Show context

ttoinou ◴[11 Jul 25 14:41 UTC] No.44532669[source]▶

What makes LLMs amazing (fuzzy input, fuzzy output) is exactly why they are hard to benchmark. If they could be benchmarked easily, they wouldn't be powerful by definition. I have no idea what's going on in the minds of people benchmarking LLMs for fuzzy tasks, and in the minds of people relying on benchmarks to make decisions about LLMs, I never looked at them. People doing benchmarks have to prove what they do is useful, not us public proving them they're doing it wrong.

Of course, for such tasks we could benchmark them :

* arithmetic (why would use LLM for that ?)

* correct JSON syntax, correct command lines etc.

* looking for specific information in a text

* looking for a missing information in a text

* language logic (ifs then elses where we know the answer in advance)

But by Goodhart's Law, LLMs that have been trained to succeed in those benchmarks might loose powerfulness in others tasks where we really need them (fuzzy inputs, fuzzy outputs)

replies(2): >>44533120 #>>44536936 #

meroes ◴[11 Jul 25 15:16 UTC] No.44533120[source]▶

>>44532669 #

> arithmetic (why would use LLM for that ?)

Because people ask LLMs all of these things, including arithmetic. People were saying the same about the number of r's in strawberry. Why ask and LLM that!?!? But the big AI companies want LLMs to be better at these questions, probably because people ask them to LLMs. The big AI companies want this because there is no other explanation for the money poured into RLHF'ing these types of problems.

replies(1): >>44533234 #

1. ttoinou ◴[11 Jul 25 15:25 UTC] No.44533234[source]▶

>>44533120 #

for me, that could only be solved by changing architecture and/or introducing more insider tooling (like calling a program to make computation). It doesnt make any sense to fine tune a fuzzy input fuzzy output natural language processing algorithm to add and multiply all combinations of six digits numbers

replies(1): >>44534203 #

2. potatolicious ◴[11 Jul 25 16:36 UTC] No.44534203[source]▶

>>44533234 (TP) #

This feels like a philosophical fault line in the industry.

For people whose purpose is to produce reliably working systems yeah, training a model that calls out to deterministic logic to do things like math makes total sense. It will pretty much always be more reliable than training a text generation model to produce correct arithmetic.

But it feels like there's another side of the industry that's more concerned with... I dunno, metaphysical aspects of these models? Where the idea that the model is a stochastic ball that isn't conscious, isn't thinking, and does poorly at various tasks is anathema. So the effort continues to try and train and fine-tune these models until... something.

It reminds me of the great Tesla-vs-everyone-else self-driving debates that raged over the past several years. Lots of people unhappy that the best-functioning systems fused many sensor types and a mixture of heuristic and machine-learned systems in a complex architecture. These folks insisted that the "best" architecture was an end-to-end machine-learned system based entirely on visible light cameras. Because it's "most human" or some other such nonsense. As far as I can tell there was never any merit to this position beyond some abstract notion of architectural purity.

Same thing here I suppose.

↑