Most active commenters

    ←back to thread

    555 points maheshrijal | 12 comments | | HN request time: 1.463s | source | bottom
    1. brap ◴[] No.43707838[source]
    Where's the comparison with Gemini 2.5 Pro?
    replies(3): >>43707846 #>>43707897 #>>43708606 #
    2. kridsdale1 ◴[] No.43707846[source]
    Exactly.
    3. gallerdude ◴[] No.43707897[source]
    For coding, I like the Aider polyglot benchmark, since it covers multiple programming languages.

    Gemini 2.5 Pro got 72.9%

    o3 high gets 81.3%, o4-mini high gets 68.9%

    replies(4): >>43708090 #>>43708632 #>>43709557 #>>43709763 #
    4. asadm ◴[] No.43708090[source]
    thanks
    5. SweetSoftPillow ◴[] No.43708606[source]
    Some sources mention that o3 scores 63.8 on SWE-bench, while Gemini 2.5 Pro scores 69.1.

    On most other benchmarks, they seem to perform about the same, which is bad news for o3 because it's much more expensive and slower than Gemini 2.5 Pro, and it also hides its reasoning while Gemini shows everything.

    We can probably just stick with Gemini 2.5 Pro, since it offers the best combination of price, quality, and speed. No need to worry about finding a replacement (for now).

    replies(1): >>43712286 #
    6. vessenes ◴[] No.43708632[source]
    where do you find those o3 high numbers? https://aider.chat/docs/leaderboards/ currently has gemini 2.5 pro as the leader at, as you say, 72.9%.
    replies(1): >>43708984 #
    7. re-thc ◴[] No.43708984{3}[source]
    It's in the OpenAI article post (OP) i.e. OpenAI ran Aider themselves.
    replies(1): >>43730783 #
    8. croemer ◴[] No.43709557[source]
    Isn't it easy to train on the specific Exercism exercises that this benchmark uses?
    9. jumpCastle ◴[] No.43709763[source]
    It was a good benchmark until it entered the training set.
    10. usaar333 ◴[] No.43712286[source]
    > Some sources mention that o3 scores 63.8 on SWE-bench, while Gemini 2.5 Pro scores 69.1.

    It's the opposite. o3 scores higher

    replies(1): >>43714684 #
    11. SweetSoftPillow ◴[] No.43714684{3}[source]
    On SWE bench? Show your source.
    12. vessenes ◴[] No.43730783{4}[source]
    Update: the leaderboard has o3 high + 4o tops of the charts now with 82.7%. This is a) amazing b) 20x more expensive than Gemini.