To be fair, I’ll take a non-biased 16 person study over “internal measures” from a MAANG company that burned 100s of billions on AI with no ROI that is now forcing its employees to use AI.
I don't speak for bopbopbop7, but I will say this: my experience of using Claude Code has been that it can do much longer tasks than the METR benchmark implies are possible.
The converse of this is that if those tasks are representative of software engineering as a whole, I would expect a lot of other tasks where it absolutely sucks.
This expectation is further supported by the number of times people pop up in conversations like this to say for any given LLM that it falls flat on its face even for something the poster thinks is simple, that it cost more time than it saved.
As with supposedly "full" self driving on Teslas, the anecdotes about the failure modes are much more interesting than the success: one person whose commute/coding problem happens to be easy, may mistake their own circumstances for normal. Until it does work everywhere, it doesn't work everywhere.
When I experiment with vibe coding (as in, properly unsupervised), it can break down large tasks into small ones and churn through each sub-task well enough, such that it can do a task I'd expect to take most of a sprint by itself. Now, that said, I will also say it seems to do these things a level of "that'll do" not "amazing!", but it does do them.
But I am very much aware this is like all the people posting "well my Tesla commute doesn't need any interventions!" in response to all the people pointing out how it's been a decade since Musk said "I think that within two years, you'll be able to summon your car from across the country. It will meet you wherever your phone is … and it will just automatically charge itself along the entire journey."
It works on my [use case], but we can't always ship my [use case].
I could have guessed you would say that :) but METR is not an unbiased study either. Maybe you mean that METR is less likely to intentionally inflate their numbers?
If you insist or believe in a conspiracy I don’t think there’s really anything I or others will be able to say or show you that would assuage you, all I can say is I’ve seen the raw data. It’s a mess and again we’re stuck with proxies (which are bad since you start conflating the change in the proxy-latent relationship with the treatment effect). And it’s also hard and arguably irresponsible to run RCTs.
All I will say is: there are flaws everywhere. METR results are far from conclusive. Totally understandable if there is a mismatch between perception and performance. But also consider: even if task takes the same or even slightly more time, one big advantage for me is that it substantially reduces cognitive load so I can work in parallel sessions on two completely different issues.
I bet it does reduce your cognitive load, considering you, in your own words "Give up when Claude is hopelessly lost". No better way to reduce cognitive load.