Really, this. You still need to check its work, but it is also pretty good at checking its work if told to look at specific things.
Make it stop. Tell it to review whether the code is cohesive. Tell it to review it for security issues. Tell it to review it for common problems you've seen in just your codebase.
Tell it to write a todo list for everything it finds, and tell it fix it.
And only review the code once it's worked through a checklist of its own reviews.
We wouldn't waste time reviewing a first draft from another developer if they hadn't bothered looking over it and test it properly, so why would we do that for an AI agent that is far cheaper.
I wouldn't mind see a collection of objectives and the emitted output. My experience with LLM output is that they are very often over-engineered for no good reason, which is taxing on me to review.
I want to see this code written to some objective, to compare with what I would have written to the same objective. What I've seen so far are specs so detailed that very little is left to the discretion of the LLM.
What I want to see are those where the LLM is asked for something, and provided it because I am curious to compare it to my proposed solution.
(This sounds like a great idea for a site that shows users the user-submitted task, and only after they submit their attempt does it show them the LLM's attempt. Someone please vibe code this up, TIA)
It absolutely can, I'm building things to do this for me. Claude Code has hooks that are supposed to trigger upon certain states and so far they don't trigger reliably enough to be useful. What we need are the primitives to build code based development cycles where each step is executed by a model but the flow is dictated by code. Everything today relies too heavily on prompt engineering and with long context windows instruction following goes lax. I ask my model "What did you do wrong?" and it comes back clearly with "I didn't follow instructions" and then gives clear and detailed correct reasons about how it didn't follow instructions... but that's not supremely helpful because it still doesn't follow instructions afterwards.
It increasingly is. E.g. if you use Claude Code, you'll notice it "likes" to produce todo lists that rendered specially via the TodoWrite tool that's built in.
But it's also a balance of avoiding being over-prescriptive in tools that needs to support very different workflows, and it's easy to add more specific checks via plugins.
We're bound to see more packaged up workflows over time, but the tooling here is still in very early stages.
Tell it to grade its work in various categories and that you'll only accept B+ or greater work. Focusing on how good it's doing is an important distinction.
Oh I'm not at all joking. It's better at evaluating quality than producing it blindly. Tell it to grade it's work and it can tell you most of the stuff it did wrong. Tell it to grade it's work again. Keep going through the cycle and you'll get significantly better code.
The thinking should probably include this kind of introspection (give me a million dollars for training and I'll write a paper) but if it doesn't you can just prompt it to.
Think of it as a "you should - and is allowed to - spend more time on this" command, because that is pretty much what it is. The model only gets so much "thinking time" to produce the initial output. By asking it to iterate you're giving it more time to think and iterate.