←back to thread

GPT-5.2

(openai.com)
1053 points atgctg | 2 comments | | HN request time: 0s | source
Show context
josalhor ◴[] No.46235005[source]
From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

replies(7): >>46235062 #>>46235070 #>>46235153 #>>46235160 #>>46235180 #>>46235421 #>>46236242 #
verdverm ◴[] No.46235062[source]
We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day
replies(5): >>46235126 #>>46235266 #>>46235466 #>>46235492 #>>46235583 #
stego-tech ◴[] No.46235583[source]
These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?

Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”

It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).

replies(2): >>46236156 #>>46236484 #
snet0 ◴[] No.46236484[source]
To say that a model won't solve a problem is unfair. Claude Code, with Opus 4.5, has solved plenty of problems for me.

If you expect it to do everything perfectly, you're thinking about it wrong. If you can't get it to do anything perfectly, you're using it wrong.

replies(1): >>46236543 #
jacquesm ◴[] No.46236543[source]
That means you're probably asking it to do very simple things.
replies(4): >>46236778 #>>46236779 #>>46236916 #>>46243575 #
snet0 ◴[] No.46236916[source]
If you define "simple thing" as "thing an AI can't do", then yes. Everyone just shifts the goalposts in these conversations, it's infuriating.
replies(1): >>46237055 #
1. ACCount37 ◴[] No.46237055[source]
Come on. If we weren't shifting the goalposts, we would have burned through 90% of the entire supply of them back in 2022!
replies(1): >>46237748 #
2. baq ◴[] No.46237748[source]
It’s less shifting goalposts and more of a very jagged frontier of capabilities problem.