Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"23 SWE-Bench Verified samples that were not runnable on our internal infrastructure were excluded."

What does that mean? Surely this should have a bit more elaboration. If you're just excluding a double digit number of tasks in the benchmark as uncompleted, that should be reflected in the scores.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: