GPT-5.2 | slacker news

1. minadotcom ◴[11 Dec 25 18:29 UTC] No.46235074[source]▶

They used to compare to competing models from Anthropic, Google DeepMind, DeepSeek, etc. Seems that now they only compare to their own models. Does this mean that the GPT-series is performing worse than its competitors (given the "code red" at OpenAI)?

replies(4): >>46235094 #>>46235110 #>>46235145 #>>46236816 #

2. poormathskills ◴[11 Dec 25 18:30 UTC] No.46235094[source]▶

>>46235074 (TP) #

OpenAI has never compared their models to models from other labs in their blog post. Open literally any past model launch post to see that.

replies(1): >>46235745 #

3. tabletcorry ◴[11 Dec 25 18:32 UTC] No.46235110[source]▶

>>46235074 (TP) #

The matrix required for a fair comparison is getting too complicated, since you have to compare chat/thinking/pro against an array of Anthropic and Google models.

But they publish all the same numbers, so you can make the full comparison yourself, if you want to.

4. Tiberium ◴[11 Dec 25 18:35 UTC] No.46235145[source]▶

>>46235074 (TP) #

They did compare it to other models: https://x.com/OpenAI/status/1999182104362668275

https://i.imgur.com/e0iB8KC.png

replies(3): >>46235919 #>>46238146 #>>46241683 #

5. boole1854 ◴[11 Dec 25 19:11 UTC] No.46235745[source]▶

>>46235094 #

https://openai.com/index/hello-gpt-4o/

I see evaluations compared with Claude, Gemini, and Llama there on the GPT 4o post.

replies(1): >>46236028 #

6. enlyth ◴[11 Dec 25 19:22 UTC] No.46235919[source]▶

>>46235145 #

This looks cherry-picked, for example Claude Opus had a higher score on SWE-Bench Verified so they conveniently left it out, also GDPval is literally a benchmark made by OpenAI

replies(2): >>46237361 #>>46239897 #

7. kgwgk ◴[11 Dec 25 19:30 UTC] No.46236028{3}[source]▶

>>46235745 #

“You are absolutely right, and I apologize for the confusion.”

8. Workaccount2 ◴[11 Dec 25 20:38 UTC] No.46236816[source]▶

>>46235074 (TP) #

They are taking a page out of Apple's book.

Apple only compares to themselves. They don't even acknowledge the existence of others.

9. minadotcom ◴[11 Dec 25 21:24 UTC] No.46237361{3}[source]▶

>>46235919 #

agreed.

10. whimsicalism ◴[11 Dec 25 22:27 UTC] No.46238146[source]▶

>>46235145 #

uh oh, where did SWE bench go :D

replies(1): >>46240109 #

11. tobias2014 ◴[12 Dec 25 01:41 UTC] No.46239897{3}[source]▶

>>46235919 #

And who believes that the difference between 91.9% and 92.4% is significant in these benchmarks? Clearly these have margins of error that are swept under the rug.

12. whimsicalism ◴[12 Dec 25 02:11 UTC] No.46240109{3}[source]▶

>>46238146 #

maybe they will release with gpt-5.2-codex

13. sergdigon ◴[12 Dec 25 07:20 UTC] No.46241683[source]▶

>>46235145 #

The fact that the post is comparing their reasoning model against gemini 3 pro (the "non reasoning" model) and not gemini 3 pro deep think (the reasoning one) is quite nasty. If you compare GPT5.2 thinking to gemini 3 pro deep think, the scores are quite similar (sometimes one is better sometimes the other one is)