I'm finding the latest models are pretty good at debugging, if you give them the tools to debug properly
If they can run a tool from the terminal, see all the output in text format, and have a clear 'success' criteria, then they're usually able to figure out the issue and fix it (often with spaghetti code patching, but it does at least fix the bug)
I think the testing/verification part is going to keep getting better, as we figure out better tools the AI can use here (ex, parsing the accessibility tree in a web UI to click around in it and verify)
If they can run a tool from the terminal, see all the output in text format, and have a clear 'success' criteria, then they're usually able to figure out the issue and fix it (often with spaghetti code patching, but it does at least fix the bug)
I think the testing/verification part is going to keep getting better, as we figure out better tools the AI can use here (ex, parsing the accessibility tree in a web UI to click around in it and verify)