OpenAI o3 and o4-mini

(openai.com)

555 points maheshrijal | 3 comments | 16 Apr 25 17:01 UTC | HN request time: 0.479s | source

Show context

georgewsinger ◴[16 Apr 25 17:20 UTC] No.43707951[source]▶

Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]

Incredible how resilient Claude models have been for best-in-coding class.

[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).

replies(7): >>43708008 #>>43708068 #>>43708249 #>>43708545 #>>43709203 #>>43713202 #>>43716307 #

1. lattalayta ◴[16 Apr 25 17:41 UTC] No.43708249[source]▶

>>43707951 #

I haven't been following them that closely, but are people finding these benchmarks relevant? It seems like these companies could just tune their models to do well on particular benchmarks

replies(2): >>43708433 #>>43712302 #

2. emp17344 ◴[16 Apr 25 17:56 UTC] No.43708433[source]▶

>>43708249 (TP) #

That’s exactly what’s happening. I’m not convinced there’s any real progress occurring here.

3. mickael-kerjean ◴[17 Apr 25 01:52 UTC] No.43712302[source]▶

>>43708249 (TP) #

The benchmark is something you can optimize for, doesn't mean it generalize well. Yesterday I tried for 2 hours to get claude to create a program that would extract data from a weird adobe file. 10$ later, the best I had is a program that was doing something like:

  switch(testFile) {
    case "test1.ase": // run this because it's a particular case 
    case "test2.ase": // run this because it's a particular case
    default:  // run something that's not working but that's ok because the previous case should
              // give the right output for all the test files ...
  }

↑