SWE-bench is a great eval, but it's very narrow. Two models can have the same SWE-bench scores but very different user experiences.
Here's a nice thread on X about the things that SWE-bench doesn't measure:
https://x.com/brhydon/status/1953648884309536958
https://nitter.net/brhydon/status/1953648884309536958
SWE-bench is a great eval, but it's very narrow. Two models can have the same SWE-bench scores but very different user experiences.
Here's a nice thread on X about the things that SWE-bench doesn't measure:
https://x.com/brhydon/status/1953648884309536958