GPT-5.2

(openai.com)

1019 points atgctg | 3 comments | 11 Dec 25 18:04 UTC | HN request time: 0s | source

https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

Show context

josalhor ◴[11 Dec 25 18:24 UTC] No.46235005[source]▶

>>46234788 (OP) #

From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

replies(7): >>46235062 #>>46235070 #>>46235153 #>>46235160 #>>46235180 #>>46235421 #>>46236242 #

verdverm ◴[11 Dec 25 18:28 UTC] No.46235062[source]▶

>>46235005 #

We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day

replies(5): >>46235126 #>>46235266 #>>46235466 #>>46235492 #>>46235583 #

stego-tech ◴[11 Dec 25 19:01 UTC] No.46235583[source]▶

>>46235062 #

These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?

Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”

It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).

replies(2): >>46236156 #>>46236484 #

snet0 ◴[11 Dec 25 20:11 UTC] No.46236484{3}[source]▶

>>46235583 #

To say that a model won't solve a problem is unfair. Claude Code, with Opus 4.5, has solved plenty of problems for me.

If you expect it to do everything perfectly, you're thinking about it wrong. If you can't get it to do anything perfectly, you're using it wrong.

replies(1): >>46236543 #

jacquesm ◴[11 Dec 25 20:15 UTC] No.46236543{4}[source]▶

>>46236484 #

That means you're probably asking it to do very simple things.

replies(3): >>46236778 #>>46236779 #>>46236916 #

1. camdenreslink ◴[11 Dec 25 20:35 UTC] No.46236778{5}[source]▶

>>46236543 #

Sometimes you do need to (as a human) break down a complex thing into smaller simple things, and then ask the LLM to do those simple things. I find it still saves some time.

replies(1): >>46237448 #

2. ragequittah ◴[11 Dec 25 21:31 UTC] No.46237448[source]▶

>>46236778 (TP) #

Or what will often work is having the LLM break it down into simpler steps and then running them 1 by 1. They know how to break down problems fairly well they just don't often do it properly sometimes unless you explicitly prompt them to.

replies(1): >>46237610 #

3. jacquesm ◴[11 Dec 25 21:44 UTC] No.46237610[source]▶

>>46237448 #

Yes, but for that you have to know that the output it gave you is wrong in the first place and if that is so you didn't need AI to begin with...

↑