OpenAI o3 and o4-mini

(openai.com)

555 points maheshrijal | 3 comments | 16 Apr 25 17:01 UTC | HN request time: 0.853s | source

Show context

georgewsinger ◴[16 Apr 25 17:20 UTC] No.43707951[source]▶

Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]

Incredible how resilient Claude models have been for best-in-coding class.

[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).

replies(7): >>43708008 #>>43708068 #>>43708249 #>>43708545 #>>43709203 #>>43713202 #>>43716307 #

jjani ◴[16 Apr 25 17:27 UTC] No.43708068[source]▶

>>43707951 #

Gemini 2.5 Pro is widely considered superior to 3.7 Sonnet now by heavy users, but they don't have an SWE-bench score. Shows that looking at one such benchmark isn't very telling. Main advantage over Sonnet being that it's better at using a large amount of context, which is enormously helpful during coding tasks.

Sonnet is still an incredibly impressive model as it held the crown for 6 months, which may as well be a decade with the current pace of LLM improvement.

replies(6): >>43708198 #>>43709336 #>>43710444 #>>43712513 #>>43714843 #>>43720979 #

unsupp0rted ◴[16 Apr 25 17:37 UTC] No.43708198[source]▶

>>43708068 #

Main advantage over Sonnet is Gemini 2.5 doesn't try to make a bunch of unrelated changes like it's rewriting my project from scratch.

replies(6): >>43708296 #>>43708338 #>>43708390 #>>43708580 #>>43708811 #>>43709225 #

bitbuilder ◴[16 Apr 25 18:34 UTC] No.43708811[source]▶

>>43708198 #

This was incredibly irritating at first, though over time I've learned to appreciate this "extra credit" work. It can be fun to see what Claude thinks I can do better, or should add in addition to whatever feature I just asked for. Especially when it comes to UI work, Claude actually has some pretty cool ideas.

If I'm using Claude through Copilot where it's "free" I'll let it do its thing and just roll back to the last commit if it gets too ambitious. If I really want it to stay on track I'll explicitly tell it in the prompt to focus only on what I've asked, and that seems to work.

And just today, I found myself leaving a comment like this: //Note to Claude: Do not refactor the below. It's ugly, but it's supposed to be that way.

Never thought I'd see the day I was leaving comments for my AI agent coworker.

replies(1): >>43709077 #

1. TuxSH ◴[16 Apr 25 18:58 UTC] No.43709077[source]▶

>>43708811 #

> If I'm using Claude through Copilot where it's "free"

Too bad Microsoft is widely limiting this -- have you seen their pricing changes?

I also feel like they nerfed their models, or reduced context window again.

replies(1): >>43711593 #

2. Aeolun ◴[16 Apr 25 23:51 UTC] No.43711593[source]▶

>>43709077 (TP) #

Claude is almost comically good outside of copilot. When using through copilot it’s like working with a lobotomized idiot (that complains it generated public code about half the time).

replies(1): >>43735445 #

3. TuxSH ◴[19 Apr 25 10:20 UTC] No.43735445[source]▶

>>43711593 #

It used to be good, or at least quite decent in GH Copilot, but it all turned into poop (the completions, the models, everything) ever since they announced the pricing changes.

Considering that M$ obviously trains over GitHub data, I'm a bit pissed, honestly, even if I get GH Copilot Pro for free.

↑