(openai.com)

1019 points atgctg | 1 comments | 11 Dec 25 18:04 UTC | HN request time: 0s | source

https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

Show context

josalhor ◴[11 Dec 25 18:24 UTC] No.46235005[source]▶

>>46234788 (OP) #

From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

replies(7): >>46235062 #>>46235070 #>>46235153 #>>46235160 #>>46235180 #>>46235421 #>>46236242 #

causal ◴[11 Dec 25 18:37 UTC] No.46235180[source]▶

>>46235005 #

That ARC AGI score is a little suspicious. That's a really tough for AI benchmark. Curious if there were improvements to the test harness because that's a wild jump in general problem solving ability for an incremental update.

replies(2): >>46235387 #>>46238284 #

1. woeirua ◴[11 Dec 25 22:37 UTC] No.46238284[source]▶

>>46235180 #

They're clearly building better training datasets and doing extensive RL on these benchmarks over time. The out of distribution performance is still awful.

↑

GPT-5.2