(openai.com)

511 points meetpateltech | 2 comments | 16 May 25 15:02 UTC | HN request time: 0.443s | source

Show context

ofirpress ◴[16 May 25 17:51 UTC] No.44008115[source]▶

[I'm one of the co-creators of SWE-bench] The team managed to improve on the already very strong o3 results on SWE-bench, but it's interesting that we're just seeing an improvement of a few percentage points. I wonder if getting to 85% from 75% on Verified is going to take as long as it took to get from 20% to 75%.

replies(2): >>44008209 #>>44009418 #

Snuggly73 ◴[16 May 25 18:00 UTC] No.44008209[source]▶

>>44008115 #

I can be completely off base, but it feels to me like benchmaxxing is going on with swe-bench.

Look at the results from multi swe bench - https://multi-swe-bench.github.io/#/

swe polybench - https://amazon-science.github.io/SWE-PolyBench/

Kotlin bench - https://firebender.com/leaderboard

replies(3): >>44010749 #>>44011138 #>>44017378 #

1. ofirpress ◴[17 May 25 00:46 UTC] No.44011138[source]▶

>>44008209 #

Not sure what you mean by benchmaxxing but we think there's still a lot of useful signals you can infer from SWE-bench-style benchmarking.

We also have SWE-bench Multimodal which adds a twist I haven't seen elsewhere: https://www.swebench.com/multimodal.html

replies(1): >>44011612 #

2. Snuggly73 ◴[17 May 25 02:35 UTC] No.44011612[source]▶

>>44011138 (TP) #

I mean that there is the possibility that swe bench is being specifically targeted for training and the results may not reflect real world performance.

↑

A Research Preview of Codex