←back to thread

AI agent benchmarks are broken

(ddkang.substack.com)
181 points neehao | 1 comments | | HN request time: 0.253s | source
Show context
deepdarkforest ◴[] No.44531923[source]
It's very funny how many layers of abstraction we are going through. We have limited understanding of how LLM's work exactly and why. We now do post training with RL, which again, we don't have perfect understanding of it either. Then you stack LLMs calls and random tools, and you have agents, and you are attempting to benchmark those. (and this exclude voice, computer use agents etc).

It's all just vibes,there is no good general benchmark for agents and i think it's just impossible, there are just way too many degrees of freedom to achieve anything useful. They're just a complicated tool to achieve things. It's like trying to make a general use benchmark of a stack of 10 microservices together. It does not make sense, it just depends on your usecase and your own metrics

replies(2): >>44531975 #>>44536131 #
1. rf15 ◴[] No.44536131[source]
> We have limited understanding of how LLM's work exactly and why.

blatantly untrue, and as a concept only useful to those who want to sell AI as this "magical thing" that "just works"