Most active commenters

    ←back to thread

    GPT-5.2

    (openai.com)
    1019 points atgctg | 13 comments | | HN request time: 0.701s | source | bottom
    1. minadotcom ◴[] No.46235074[source]
    They used to compare to competing models from Anthropic, Google DeepMind, DeepSeek, etc. Seems that now they only compare to their own models. Does this mean that the GPT-series is performing worse than its competitors (given the "code red" at OpenAI)?
    replies(4): >>46235094 #>>46235110 #>>46235145 #>>46236816 #
    2. poormathskills ◴[] No.46235094[source]
    OpenAI has never compared their models to models from other labs in their blog post. Open literally any past model launch post to see that.
    replies(1): >>46235745 #
    3. tabletcorry ◴[] No.46235110[source]
    The matrix required for a fair comparison is getting too complicated, since you have to compare chat/thinking/pro against an array of Anthropic and Google models.

    But they publish all the same numbers, so you can make the full comparison yourself, if you want to.

    4. Tiberium ◴[] No.46235145[source]
    They did compare it to other models: https://x.com/OpenAI/status/1999182104362668275

    https://i.imgur.com/e0iB8KC.png

    replies(3): >>46235919 #>>46238146 #>>46241683 #
    5. boole1854 ◴[] No.46235745[source]
    https://openai.com/index/hello-gpt-4o/

    I see evaluations compared with Claude, Gemini, and Llama there on the GPT 4o post.

    replies(1): >>46236028 #
    6. enlyth ◴[] No.46235919[source]
    This looks cherry-picked, for example Claude Opus had a higher score on SWE-Bench Verified so they conveniently left it out, also GDPval is literally a benchmark made by OpenAI
    replies(2): >>46237361 #>>46239897 #
    7. kgwgk ◴[] No.46236028{3}[source]
    “You are absolutely right, and I apologize for the confusion.”
    8. Workaccount2 ◴[] No.46236816[source]
    They are taking a page out of Apple's book.

    Apple only compares to themselves. They don't even acknowledge the existence of others.

    9. minadotcom ◴[] No.46237361{3}[source]
    agreed.
    10. whimsicalism ◴[] No.46238146[source]
    uh oh, where did SWE bench go :D
    replies(1): >>46240109 #
    11. tobias2014 ◴[] No.46239897{3}[source]
    And who believes that the difference between 91.9% and 92.4% is significant in these benchmarks? Clearly these have margins of error that are swept under the rug.
    12. whimsicalism ◴[] No.46240109{3}[source]
    maybe they will release with gpt-5.2-codex
    13. sergdigon ◴[] No.46241683[source]
    The fact that the post is comparing their reasoning model against gemini 3 pro (the "non reasoning" model) and not gemini 3 pro deep think (the reasoning one) is quite nasty. If you compare GPT5.2 thinking to gemini 3 pro deep think, the scores are quite similar (sometimes one is better sometimes the other one is)