Ok, but if you wrote some massive corpus of code with no testing it probably would not compile either.
I think if you want to make this a useful experiment you should use one of the coding assistants that can test and iterate on its code, not some chatbot which is optimized to impress nontechnical people while being as cheap as possible to run.
That depends a lot on the system prompt and the tooling available to the model. Are you trying thin in Claude code or Factory.ai, or are you using a chat interface? The difference in the outcome can be large.
The name of the model is not the end of the story. There is a Pareto frontier of performance vs computational cost, and the companies have various knobs and dials they can tune to trade off performance for cost. This is why openai reports costs of $1k/problem when they test their models on the math/coding benchmarks, yet charge you only $15/month for a subscription to their web interface.
I think if you want to make this a useful experiment you should use one of the coding assistants that can test and iterate on its code, not some chatbot which is optimized to impress nontechnical people while being as cheap as possible to run.