(openai.com)

1019 points atgctg | 4 comments | 11 Dec 25 18:04 UTC | HN request time: 0.001s | source

https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

1. mattas ◴[11 Dec 25 18:32 UTC] No.46235111[source]▶

Are benchmarks the right way to measure LLMs? Not because benchmarks can be gamed, but because the most useful outputs of models aren't things that can be bucketed into "right" and "wrong." Tough problem!

replies(2): >>46235164 #>>46235214 #

2. Sir_Twist ◴[11 Dec 25 18:36 UTC] No.46235164[source]▶

>>46235111 (TP) #

Not an expert in LLM benchmarks, but I generally I think of benchmarks as being good particularly for measuring usefulness for certain usecases. Even if measuring LLMs is not as straightforward as, say, read/write speeds when comparing different SSDs, if a certain model's responses are consistently measured as being higher quality / more useful, surely that means something, right?

3. olliepro ◴[11 Dec 25 18:40 UTC] No.46235214[source]▶

>>46235111 (TP) #

Do you have a better way to measure LLMs? Measurement implies quantitative evaluation... which is the same as benchmarks.

replies(1): >>46236704 #

4. Wowfunhappy ◴[11 Dec 25 20:28 UTC] No.46236704[source]▶

>>46235214 #

I don’t have a good way to measure them, but I think they should be evaluated more like how we evaluate movies, or restaurants. Namely, experienced critics try them and write reviews.

↑

GPT-5.2