(ddkang.substack.com)

181 points neehao | 2 comments | 11 Jul 25 13:06 UTC | HN request time: 0.53s | source

1. KTibow ◴[11 Jul 25 17:05 UTC] No.44534575[source]▶

This is more or less a funnel to their Agentic Benchmark Checklist: https://arxiv.org/abs/2507.02825

replies(1): >>44536725 #

Finally, a benchmark for benchmarks. And what's great is that they already benchmarked their benchmark benchmark.

(Apologies for the benchmark snark. I'm glad people are doing this research, thanks for sharing it.)

AI agent benchmarks are broken