GPT-5.2

(openai.com)

1053 points atgctg | 4 comments | 11 Dec 25 18:04 UTC | HN request time: 0s | source

https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

Show context

josalhor ◴[11 Dec 25 18:24 UTC] No.46235005[source]▶

>>46234788 (OP) #

From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

replies(7): >>46235062 #>>46235070 #>>46235153 #>>46235160 #>>46235180 #>>46235421 #>>46236242 #

verdverm ◴[11 Dec 25 18:28 UTC] No.46235062[source]▶

>>46235005 #

We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day

replies(5): >>46235126 #>>46235266 #>>46235466 #>>46235492 #>>46235583 #

1. quantumHazer ◴[11 Dec 25 18:34 UTC] No.46235126[source]▶

>>46235062 #

Seems pretty false if you look at the model card and web site of Opus 4.5 that is… (check notes) their latest model.

replies(1): >>46235373 #

2. verdverm ◴[11 Dec 25 18:49 UTC] No.46235373[source]▶

>>46235126 (TP) #

Building a good model generally means it will do well on benchmarks too. The point of the speculation is that Anthropic is not focused on benchmaxxing which is why they have models people like to use for their day-to-day.

I use Gemini, Anthropic stole $50 from me (expired and kept my prepaid credits) and I have not forgiven them yet for it, but people rave about claude for coding so I may try the model again through Vertex Ai...

The person who made the speculation I believe was more talking about blog posts and media statements than model cards. Most ai announcements come with benchmark touting, Anthropic supposedly does less / little of this in their announcements. I haven't seen or gathered the data to know what is truth

replies(1): >>46236265 #

3. elcritch ◴[11 Dec 25 19:51 UTC] No.46236265[source]▶

>>46235373 #

You could try Codex cli. I prefer it over Claude code now, but only slightly.

replies(1): >>46236368 #

4. verdverm ◴[11 Dec 25 20:01 UTC] No.46236368{3}[source]▶

>>46236265 #

No thanks, not touching anything Oligarchy Altman is behind

↑