←back to thread

GPT-5.2

(openai.com)
1019 points atgctg | 7 comments | | HN request time: 0.639s | source | bottom
Show context
minadotcom ◴[] No.46235074[source]
They used to compare to competing models from Anthropic, Google DeepMind, DeepSeek, etc. Seems that now they only compare to their own models. Does this mean that the GPT-series is performing worse than its competitors (given the "code red" at OpenAI)?
replies(4): >>46235094 #>>46235110 #>>46235145 #>>46236816 #
1. Tiberium ◴[] No.46235145[source]
They did compare it to other models: https://x.com/OpenAI/status/1999182104362668275

https://i.imgur.com/e0iB8KC.png

replies(3): >>46235919 #>>46238146 #>>46241683 #
2. enlyth ◴[] No.46235919[source]
This looks cherry-picked, for example Claude Opus had a higher score on SWE-Bench Verified so they conveniently left it out, also GDPval is literally a benchmark made by OpenAI
replies(2): >>46237361 #>>46239897 #
3. minadotcom ◴[] No.46237361[source]
agreed.
4. whimsicalism ◴[] No.46238146[source]
uh oh, where did SWE bench go :D
replies(1): >>46240109 #
5. tobias2014 ◴[] No.46239897[source]
And who believes that the difference between 91.9% and 92.4% is significant in these benchmarks? Clearly these have margins of error that are swept under the rug.
6. whimsicalism ◴[] No.46240109[source]
maybe they will release with gpt-5.2-codex
7. sergdigon ◴[] No.46241683[source]
The fact that the post is comparing their reasoning model against gemini 3 pro (the "non reasoning" model) and not gemini 3 pro deep think (the reasoning one) is quite nasty. If you compare GPT5.2 thinking to gemini 3 pro deep think, the scores are quite similar (sometimes one is better sometimes the other one is)