OpenAI o3 and o4-mini

(openai.com)

555 points maheshrijal | 2 comments | 16 Apr 25 17:01 UTC | HN request time: 0s | source

Show context

georgewsinger ◴[16 Apr 25 17:20 UTC] No.43707951[source]▶

Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]

Incredible how resilient Claude models have been for best-in-coding class.

[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).

replies(7): >>43708008 #>>43708068 #>>43708249 #>>43708545 #>>43709203 #>>43713202 #>>43716307 #

jjani ◴[16 Apr 25 17:27 UTC] No.43708068[source]▶

>>43707951 #

Gemini 2.5 Pro is widely considered superior to 3.7 Sonnet now by heavy users, but they don't have an SWE-bench score. Shows that looking at one such benchmark isn't very telling. Main advantage over Sonnet being that it's better at using a large amount of context, which is enormously helpful during coding tasks.

Sonnet is still an incredibly impressive model as it held the crown for 6 months, which may as well be a decade with the current pace of LLM improvement.

replies(6): >>43708198 #>>43709336 #>>43710444 #>>43712513 #>>43714843 #>>43720979 #

unsupp0rted ◴[16 Apr 25 17:37 UTC] No.43708198[source]▶

>>43708068 #

Main advantage over Sonnet is Gemini 2.5 doesn't try to make a bunch of unrelated changes like it's rewriting my project from scratch.

replies(6): >>43708296 #>>43708338 #>>43708390 #>>43708580 #>>43708811 #>>43709225 #

jdgoesmarching ◴[16 Apr 25 17:48 UTC] No.43708338[source]▶

>>43708198 #

Also that Gemini 2.5 still doesn’t support prompt caching, which is huge for tools like Cline.

replies(1): >>43708480 #

1. scrlk ◴[16 Apr 25 18:00 UTC] No.43708480[source]▶

>>43708338 #

2.5 Pro supports prompt caching now: https://cloud.google.com/vertex-ai/generative-ai/docs/models...

replies(1): >>43708565 #

2. jdgoesmarching ◴[16 Apr 25 18:07 UTC] No.43708565[source]▶

>>43708480 (TP) #

Oh, that must’ve been in the last few days. Weird that it’s only in 2.5 Pro preview but at least they’re headed in the right direction.

Now they just need a decent usage dashboard that doesn’t take a day to populate or require additional GCP monitoring services to break out the model usage.

↑