OpenAI o3 and o4-mini

(openai.com)

555 points maheshrijal | 2 comments | 16 Apr 25 17:01 UTC | HN request time: 0s | source

Show context

georgewsinger ◴[16 Apr 25 17:20 UTC] No.43707951[source]▶

Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]

Incredible how resilient Claude models have been for best-in-coding class.

[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).

replies(7): >>43708008 #>>43708068 #>>43708249 #>>43708545 #>>43709203 #>>43713202 #>>43716307 #

jjani ◴[16 Apr 25 17:27 UTC] No.43708068[source]▶

>>43707951 #

Gemini 2.5 Pro is widely considered superior to 3.7 Sonnet now by heavy users, but they don't have an SWE-bench score. Shows that looking at one such benchmark isn't very telling. Main advantage over Sonnet being that it's better at using a large amount of context, which is enormously helpful during coding tasks.

Sonnet is still an incredibly impressive model as it held the crown for 6 months, which may as well be a decade with the current pace of LLM improvement.

replies(6): >>43708198 #>>43709336 #>>43710444 #>>43712513 #>>43714843 #>>43720979 #

unsupp0rted ◴[16 Apr 25 17:37 UTC] No.43708198[source]▶

>>43708068 #

Main advantage over Sonnet is Gemini 2.5 doesn't try to make a bunch of unrelated changes like it's rewriting my project from scratch.

replies(6): >>43708296 #>>43708338 #>>43708390 #>>43708580 #>>43708811 #>>43709225 #

itsmevictor ◴[16 Apr 25 17:45 UTC] No.43708296[source]▶

>>43708198 #

I find Gemini 2.5 truly remarkable and overall better than Claude, which I was a big fan of

replies(1): >>43708611 #

enraged_camel ◴[16 Apr 25 18:14 UTC] No.43708611[source]▶

>>43708296 #

Still doesn't work well in Cursor unfortunately.

replies(3): >>43709559 #>>43710997 #>>43712870 #

1. plantain ◴[16 Apr 25 22:20 UTC] No.43710997[source]▶

>>43708611 #

Working fine here. What problems do you see?

replies(1): >>43711498 #

2. michaelbarton ◴[16 Apr 25 23:38 UTC] No.43711498[source]▶

>>43710997 (TP) #

Not the OP but believe they could be referring to the fact it’s not supported in edit mode yet, only agent mode.

So far for me that’s not been too much of a roadblock. Though I still find overall Gemini struggles with more obscure issues such as SQL errors in dbt

↑