←back to thread

GPT-5.2

(openai.com)
1019 points atgctg | 1 comments | | HN request time: 0.265s | source
Show context
mattas ◴[] No.46235111[source]
Are benchmarks the right way to measure LLMs? Not because benchmarks can be gamed, but because the most useful outputs of models aren't things that can be bucketed into "right" and "wrong." Tough problem!
replies(2): >>46235164 #>>46235214 #
olliepro ◴[] No.46235214[source]
Do you have a better way to measure LLMs? Measurement implies quantitative evaluation... which is the same as benchmarks.
replies(1): >>46236704 #
1. Wowfunhappy ◴[] No.46236704[source]
I don’t have a good way to measure them, but I think they should be evaluated more like how we evaluate movies, or restaurants. Namely, experienced critics try them and write reviews.