←back to thread

AI agent benchmarks are broken

(ddkang.substack.com)
181 points neehao | 1 comments | | HN request time: 0.235s | source
1. anupj ◴[] No.44531868[source]
AI agent benchmarks are starting to feel like the self-driving car demos of 2016: impressive until you realize the test track has speed bumps labeled "success"