"23 SWE-Bench Verified samples that were not runnable on our internal infrastructure were excluded."
What does that mean? Surely this should have a bit more elaboration. If you're just excluding a double digit number of tasks in the benchmark as uncompleted, that should be reflected in the scores.