(openai.com)

1019 points atgctg | 2 comments | 11 Dec 25 18:04 UTC | HN request time: 0s | source

https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

Show context

josalhor ◴[11 Dec 25 18:24 UTC] No.46235005[source]▶

>>46234788 (OP) #

From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

replies(7): >>46235062 #>>46235070 #>>46235153 #>>46235160 #>>46235180 #>>46235421 #>>46236242 #

verdverm ◴[11 Dec 25 18:28 UTC] No.46235062[source]▶

>>46235005 #

We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day

replies(5): >>46235126 #>>46235266 #>>46235466 #>>46235492 #>>46235583 #

Mistletoe ◴[11 Dec 25 18:43 UTC] No.46235266[source]▶

>>46235062 #

How do you measure whether it works better day to day without benchmarks?

replies(3): >>46235305 #>>46235348 #>>46235398 #

1. standardUser ◴[11 Dec 25 18:46 UTC] No.46235305{3}[source]▶

>>46235266 #

Subscriptions.

replies(1): >>46236136 #

2. mrguyorama ◴[11 Dec 25 19:39 UTC] No.46236136[source]▶

>>46235305 (TP) #

Ah yes, humans are famously empirical in their behavior and we definitely do not have direct evidence of the "best" sports players being much more likely than the average to be superstitious or do things like wear "lucky underwear" or buy right into scam bracelets that "give you more balance" using a holographic sticker.

↑

GPT-5.2