(openai.com)

https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

Show context

josalhor ◴[11 Dec 25 18:24 UTC] No.46235005[source]▶

>>46234788 (OP) #

From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

replies(7): >>46235062 #>>46235070 #>>46235153 #>>46235160 #>>46235180 #>>46235421 #>>46236242 #

verdverm ◴[11 Dec 25 18:28 UTC] No.46235062[source]▶

>>46235005 #

We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day

replies(5): >>46235126 #>>46235266 #>>46235466 #>>46235492 #>>46235583 #

1. Mistletoe ◴[11 Dec 25 18:43 UTC] No.46235266[source]▶

>>46235062 #

How do you measure whether it works better day to day without benchmarks?

replies(3): >>46235305 #>>46235348 #>>46235398 #

2. standardUser ◴[11 Dec 25 18:46 UTC] No.46235305[source]▶

>>46235266 (TP) #

Subscriptions.

replies(1): >>46236136 #

3. bulbar ◴[11 Dec 25 18:48 UTC] No.46235348[source]▶

>>46235266 (TP) #

Manually labeling answers maybe? There exist a lot of infrastructure built around and as it's heavily used for 2 decades and it's relatively cheap.

That's still benchmarking of course, but not utilizing any of the well known / public ones.

4. verdverm ◴[11 Dec 25 18:51 UTC] No.46235398[source]▶

>>46235266 (TP) #

Internal evals, Big AI certainly has good, proprietary training and eval data, it's one reason why their models are better

replies(1): >>46235532 #

5. aydyn ◴[11 Dec 25 18:58 UTC] No.46235532[source]▶

>>46235398 #

Then publish the results of those internal evals. Public benchmark saturation isn't an excuse to be un-quantitative.

replies(1): >>46235607 #

6. verdverm ◴[11 Dec 25 19:03 UTC] No.46235607{3}[source]▶

>>46235532 #

How would published numbers be useful without knowing what the underlying data being used to test and evaluate them are? They are proprietary for a reason

To think that Anthropic is not being intentional and quantitative in their model building, because they care less for the saturated benchmaxxing, is to miss the forest for the trees

replies(1): >>46236582 #

7. mrguyorama ◴[11 Dec 25 19:39 UTC] No.46236136[source]▶

>>46235305 #

Ah yes, humans are famously empirical in their behavior and we definitely do not have direct evidence of the "best" sports players being much more likely than the average to be superstitious or do things like wear "lucky underwear" or buy right into scam bracelets that "give you more balance" using a holographic sticker.

8. aydyn ◴[11 Dec 25 20:17 UTC] No.46236582{4}[source]▶

>>46235607 #

Do you know everything that exists in public benchmarks?

They can give a description of what their metrics are without giving away anything proprietary.

replies(1): >>46238542 #

9. verdverm ◴[11 Dec 25 23:00 UTC] No.46238542{5}[source]▶

>>46236582 #

I'd recommend watching Nathan Lambert's video he dropped yesterday on Olmo 3 Thinking. You'll learn there's a lot of places where even descriptions of proprietary testing regimes would give away some secret sauce

Nathan is at Ai2 which is all about open sourcing the process, experience, and learnings along the way

replies(1): >>46241985 #

10. aydyn ◴[12 Dec 25 08:18 UTC] No.46241985{6}[source]▶

>>46238542 #

Thanks for the reference I'll check it out. But it doesnt really take away from the point I am making. If a level of description would give away proprietary information, then go one level up to a more vague description. How to describe things to a proper level is more of a social problem than a technical one.

↑

GPT-5.2