It's very funny how many layers of abstraction we are going through. We have limited understanding of how LLM's work exactly and why. We now do post training with RL, which again, we don't have perfect understanding of it either. Then you stack LLMs calls and random tools, and you have agents, and you are attempting to benchmark those. (and this exclude voice, computer use agents etc).
It's all just vibes,there is no good general benchmark for agents and i think it's just impossible, there are just way too many degrees of freedom to achieve anything useful. They're just a complicated tool to achieve things. It's like trying to make a general use benchmark of a stack of 10 microservices together. It does not make sense, it just depends on your usecase and your own metrics
replies(2):