←back to thread

511 points meetpateltech | 1 comments | | HN request time: 0.205s | source
1. ZeroCool2u ◴[] No.44012085[source]
"23 SWE-Bench Verified samples that were not runnable on our internal infrastructure were excluded."

What does that mean? Surely this should have a bit more elaboration. If you're just excluding a double digit number of tasks in the benchmark as uncompleted, that should be reflected in the scores.