/top/
/new/
/best/
/ask/
/show/
/job/
^
slacker news
login
about
←back to thread
AI agent benchmarks are broken
(ddkang.substack.com)
181 points
neehao
| 1 comments |
11 Jul 25 13:06 UTC
|
HN request time: 0.207s
|
source
Show context
camdenreslink
◴[
11 Jul 25 13:59 UTC
]
No.
44532235
[source]
▶
>>44531697 (OP)
#
The current benchmarks are good for comparing between models, but not for measuring absolute ability.
replies(3):
>>44532298
#
>>44532615
#
>>44533085
#
1.
qsort
◴[
11 Jul 25 14:04 UTC
]
No.
44532298
[source]
▶
>>44532235
#
Not even that, see LMArena. They vaguely gesture in the general direction of the model being good, but between contamination and issues with scoring they're little more than a vibe check.
ID:
GO
↑