(ddkang.substack.com)

181 points neehao | 1 comments | 11 Jul 25 13:06 UTC | HN request time: 0.235s | source

1. anupj ◴[11 Jul 25 13:25 UTC] No.44531868[source]▶

AI agent benchmarks are starting to feel like the self-driving car demos of 2016: impressive until you realize the test track has speed bumps labeled "success"

↑

AI agent benchmarks are broken