←back to thread

GPT-5.2

(openai.com)
1019 points atgctg | 3 comments | | HN request time: 0s | source
Show context
josalhor ◴[] No.46235005[source]
From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

replies(7): >>46235062 #>>46235070 #>>46235153 #>>46235160 #>>46235180 #>>46235421 #>>46236242 #
verdverm ◴[] No.46235062[source]
We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day
replies(5): >>46235126 #>>46235266 #>>46235466 #>>46235492 #>>46235583 #
stego-tech ◴[] No.46235583[source]
These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?

Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”

It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).

replies(2): >>46236156 #>>46236484 #
snet0 ◴[] No.46236484{3}[source]
To say that a model won't solve a problem is unfair. Claude Code, with Opus 4.5, has solved plenty of problems for me.

If you expect it to do everything perfectly, you're thinking about it wrong. If you can't get it to do anything perfectly, you're using it wrong.

replies(1): >>46236543 #
jacquesm ◴[] No.46236543{4}[source]
That means you're probably asking it to do very simple things.
replies(3): >>46236778 #>>46236779 #>>46236916 #
1. camdenreslink ◴[] No.46236778{5}[source]
Sometimes you do need to (as a human) break down a complex thing into smaller simple things, and then ask the LLM to do those simple things. I find it still saves some time.
replies(1): >>46237448 #
2. ragequittah ◴[] No.46237448[source]
Or what will often work is having the LLM break it down into simpler steps and then running them 1 by 1. They know how to break down problems fairly well they just don't often do it properly sometimes unless you explicitly prompt them to.
replies(1): >>46237610 #
3. jacquesm ◴[] No.46237610[source]
Yes, but for that you have to know that the output it gave you is wrong in the first place and if that is so you didn't need AI to begin with...