AI agent benchmarks are broken

1. deepdarkforest ◴[11 Jul 25 13:30 UTC] No.44531923[source]▶

It's very funny how many layers of abstraction we are going through. We have limited understanding of how LLM's work exactly and why. We now do post training with RL, which again, we don't have perfect understanding of it either. Then you stack LLMs calls and random tools, and you have agents, and you are attempting to benchmark those. (and this exclude voice, computer use agents etc).

It's all just vibes,there is no good general benchmark for agents and i think it's just impossible, there are just way too many degrees of freedom to achieve anything useful. They're just a complicated tool to achieve things. It's like trying to make a general use benchmark of a stack of 10 microservices together. It does not make sense, it just depends on your usecase and your own metrics

replies(2): >>44531975 #>>44536131 #

2. bwfan123 ◴[11 Jul 25 13:36 UTC] No.44531975[source]▶

>>44531923 (TP) #

I can hear echos of an earlier era.

There was yahoo-pipes and web-services frameworks which rhyme with MCP and agentic.

replies(1): >>44536916 #

3. rf15 ◴[11 Jul 25 19:42 UTC] No.44536131[source]▶

>>44531923 (TP) #

> We have limited understanding of how LLM's work exactly and why.

blatantly untrue, and as a concept only useful to those who want to sell AI as this "magical thing" that "just works"

4. th0ma5 ◴[11 Jul 25 21:24 UTC] No.44536916[source]▶

>>44531975 #

Pipes and services in general are reliable but the issues were social and economic. Getting everyone to agree was seen as a great way to poach users and give up control, plus the usual problems with open world vs. closed world assumptions. Thanks for mentioning this!