(openai.com)

1084 points atgctg | 1 comments | 11 Dec 25 18:04 UTC | HN request time: 0.198s | source

https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

Show context

josalhor ◴[11 Dec 25 18:24 UTC] No.46235005[source]▶

>>46234788 (OP) #

From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

replies(7): >>46235062 #>>46235070 #>>46235153 #>>46235160 #>>46235180 #>>46235421 #>>46236242 #

verdverm ◴[11 Dec 25 18:28 UTC] No.46235062[source]▶

>>46235005 #

We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day

replies(5): >>46235126 #>>46235266 #>>46235466 #>>46235492 #>>46235583 #

stego-tech ◴[11 Dec 25 19:01 UTC] No.46235583[source]▶

>>46235062 #

These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?

Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”

It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).

replies(2): >>46236156 #>>46236484 #

snet0 ◴[11 Dec 25 20:11 UTC] No.46236484[source]▶

>>46235583 #

To say that a model won't solve a problem is unfair. Claude Code, with Opus 4.5, has solved plenty of problems for me.

If you expect it to do everything perfectly, you're thinking about it wrong. If you can't get it to do anything perfectly, you're using it wrong.

replies(1): >>46236543 #

jacquesm ◴[11 Dec 25 20:15 UTC] No.46236543[source]▶

>>46236484 #

That means you're probably asking it to do very simple things.

replies(4): >>46236778 #>>46236779 #>>46236916 #>>46243575 #

1. djeastm ◴[12 Dec 25 12:41 UTC] No.46243575[source]▶

>>46236543 #

Possibly, but a lot of value comes from doing very simple things faster.

↑

GPT-5.2