From GPT 5.1 Thinking:
ARC AGI v2: 17.6% -> 52.9%
SWE Verified: 76.3% -> 80%
That's pretty good!
replies(7):
ARC AGI v2: 17.6% -> 52.9%
SWE Verified: 76.3% -> 80%
That's pretty good!
Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”
It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).
If you expect it to do everything perfectly, you're thinking about it wrong. If you can't get it to do anything perfectly, you're using it wrong.
the real thing is are you or we getting an ROI and the answer is increasingly more yeses on more problems, this trend is not looking to plateau as we step up the complexity ladder to agentic system