Those tasks they are measuring are extremely well-defined problems that are extremely well-known. They don't really represent general programming as it is practiced day to day. "Find a fact on the web", "Train a classifier" these are trivial things given the answers are all over the place on Github, etc.
So they're getting exponentially better are doing some easy fraction of programming work. But this would be like self-driving cars getting exponentially better at driving on very safe, easy roads, with absolutely no measurement towards something like chaotic streets, or rural back-roads, or edge cases like a semi swerving or weird reflections.
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...