OpenAI o3 and o4-mini

(openai.com)

555 points maheshrijal | 4 comments | 16 Apr 25 17:01 UTC | HN request time: 0.825s | source

Show context

georgewsinger ◴[16 Apr 25 17:20 UTC] No.43707951[source]▶

Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]

Incredible how resilient Claude models have been for best-in-coding class.

[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).

replies(7): >>43708008 #>>43708068 #>>43708249 #>>43708545 #>>43709203 #>>43713202 #>>43716307 #

oofbaroomf ◴[16 Apr 25 17:23 UTC] No.43708008[source]▶

>>43707951 #

Claude got 63.2% according to the swebench.com leaderboard (listed as "Tools + Claude 3.7 Sonnet (2025-02-24)).[0] OpenAI said they got 69.1% in their blog post.

[0] swebench.com/#verified

replies(3): >>43708246 #>>43708263 #>>43708363 #

1. georgewsinger ◴[16 Apr 25 17:42 UTC] No.43708263[source]▶

>>43708008 #

Yes, however Claude advertised 70.3%[1] on SWE bench verified when using the following scaffolding:

> For Claude 3.7 Sonnet and Claude 3.5 Sonnet (new), we use a much simpler approach with minimal scaffolding, where the model decides which commands to run and files to edit in a single session. Our main “no extended thinking” pass@1 result simply equips the model with the two tools described here—a bash tool, and a file editing tool that operates via string replacements—as well as the “planning tool” mentioned above in our TAU-bench results.

Arguably this shouldn't be counted though?

[1] https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...

replies(1): >>43708567 #

2. tedsanders ◴[16 Apr 25 18:08 UTC] No.43708567[source]▶

>>43708263 (TP) #

I think you may have misread the footnote. That simpler setup results in the 62.3%/63.7% score. The 70.3% score results from a high-compute parallel setup with rejection sampling and ranking:

> For our “high compute” number we adopt additional complexity and parallel test-time compute as follows:

> We sample multiple parallel attempts with the scaffold above

> We discard patches that break the visible regression tests in the repository, similar to the rejection sampling approach adopted by Agentless; note no hidden test information is used.

> We then rank the remaining attempts with a scoring model similar to our results on GPQA and AIME described in our research post and choose the best one for the submission.

> This results in a score of 70.3% on the subset of n=489 verified tasks which work on our infrastructure. Without this scaffold, Claude 3.7 Sonnet achieves 63.7% on SWE-bench Verified using this same subset.

replies(1): >>43709569 #

3. georgewsinger ◴[16 Apr 25 19:46 UTC] No.43709569[source]▶

>>43708567 #

Somehow completely missed that, thanks!

I think reading this makes it even clearer that the 70.3% score should just be discarded from the benchmarks. "I got a 7%-8% higher SWE benchmark score by doing a bunch of extra work and sampling a ton of answers" is not something a typical user is going to have already set up when logging onto Claude and asking it a SWE style question.

Personally, it seems like an illegitimate way to juice the numbers to me (though Claude was transparent with what they did so it's all good, and it's not uninteresting to know you can boost your score by 8% with the right tooling).

replies(1): >>43712240 #

4. ianbutler ◴[17 Apr 25 01:39 UTC] No.43712240{3}[source]▶

>>43709569 #

It isn't on the benchmark https://www.swebench.com/#verified

The one on the official leaderboard is the 63% score. Presumably because of all the extra work they had to do for the 70% score.

↑