Eh, I don’t know. I spent some time over 3 days trying to get Claude Code to write a pagination plugin for ProseMirror. Had a few different branches with different implementations and none of them worked well at all and one or two of them were really over-engineered. I asked GPT-5, via the Chat UI (I don’t pay for OpenAI products), and it basically one-shot a working plugin, not only that, the code was small and comprehensible, too.
GPT-5 at its strongest is as good as any model we've seen from any provider. However, while these models aren't parrots, they are most definitely stochastic. All it takes is for a few influencers and journalists to experience a few conspicuous failures, and you get articles like this one and a growing (if unjustified) perception that the GPT-5 launch is a "flop."
I think their principal mistake was in conflating the introduction of GPT-5 with the model-selection heuristics they started using at the same time. Whatever empirical hacks they came up with to determine how much thinking should be applied to a given prompt are not working well. Then there's the immediate-but-not-really deprecation of the other models. It should have been very clear that the image-based tests that the CNN reporter referred to were not running on GPT-5 at all. But it wasn't, and that's a big marketing communications failure on OpenAI's part.
One of several, for anyone who sat through their presentation.
I wonder if GPT-5 benefited from what you learned about how-to-prompt for this problem while prompting Claude for 3 days?
I done a couple of experiments now and I can get an LLM to make not horrible and mostly functional code with effort. (I’ve been trying to create algorithms from CS papers that don’t link to code) I’ve observed once you discover the magic words the LLM wants and give sufficient background in the history, it can do ok.
But, for me anyway, the process of uncovering the magic words is slower than just writing the code myself. Although that could be because I’m targeting toy examples that aren’t very large code bases and aren’t what is in the typical internet coding demo.