From GPT 5.1 Thinking:
ARC AGI v2: 17.6% -> 52.9%
SWE Verified: 76.3% -> 80%
That's pretty good!
replies(7):
ARC AGI v2: 17.6% -> 52.9%
SWE Verified: 76.3% -> 80%
That's pretty good!
That's still benchmarking of course, but not utilizing any of the well known / public ones.
To think that Anthropic is not being intentional and quantitative in their model building, because they care less for the saturated benchmaxxing, is to miss the forest for the trees
Nathan is at Ai2 which is all about open sourcing the process, experience, and learnings along the way