Yes, my experience has been the same as yours. I find that the performance of open models is quite acceptable, even good, at one-off questions or small tasks. But they are quite unreliable at long horizon goals.
>> I'm probably in the minority, but I do not want a "connection" with a business. I want transactional interactions that actually work.
I do want a connection. Because connection is what ensures that the transactional interactions continue to work outside of the "happy path". Connection is what ensures that you can return those expensive headphones you bought because extended use makes your neck hurt, even though the return window has passed.
Anthropic has alleged that this model is much more dangerous than other currently available models. Their CEO has said so publicly multiple times. It's like asking why cesium isn't banned if nuclear missiles are banned.
(whether Mythos is actually that dangerous is beside the point; considering that Anthropic claims that it is, it makes sense to regulate it)
Complete strategic defeat and capitulation by the United States. This all but ensures Iran will become the dominant regional power in about a decade, maybe less.
People always say stuff like this, but it is misleading. The reason it's misleading is because that remaining 5% makes a huge difference, and is where most of the value of using AI agents lies.
I'm not interested in using AI to write code that would have taken me 5-10 minutes to write myself. I use AI to debug complex bugs and develop large features that span multiple domains - stuff that normally takes hours, if not days/weeks. A model that is "enough for 95%" does not cut it for that, because the failures compound during long-horizon tasks and the thing becomes a mess.
I get what you mean. But for many people, AI coding is not about solving complex problems. No, they do it mostly themselves. AI coding for many is a productivity tool, where it helps you with mundane, but laborious tasks.
In my setup, I use a daily workhorse for such things. They should be fast, cheap and reasonably working well. I don’t expect it to be smart, but need it to follow instructions perfectly and handle tool calling well.
For architectural work or debugging help, I use the top models instead.
That works reasonably well for me with a low cost.
I'm not sure about that. Claude has some bugs, but Codex is not as polished and doesn't have as many features. For example, you need to add MCP servers manually. There's no Plugin/Skill/Connector marketplace that is accessible from within the app, like there is with Claude Desktop. The Cowork-equivalent is nowhere as powerful. And so on.
I still use Codex, but mostly when I need to check Opus 4.8's work. Pretty sure I will stop doing that soon, because during the short time Fable was available, Codex was not able to find any important issues with the code Fable wrote.
But how many plugins are people actually using? I can think of one MCP server I find valuable (context7) and one plugin that i've installed, but continuously think about uninstalling (obra/superpowers).
It's a good thing. I hate MCPs from the bottom of my heart because they always stay there and bloat the context window. Also, usually developers who develop them don't know what they're doing, so the MCP responses also bloat your context even further.
That's the first time I saw someone prefering GPT-styled output over Claude ;) It's the complete opposite for me, GPT is way too verbose (even after telling it to STFU), overwhelms the user with thousands of options and doesn't just answer a question without shitting out thousands of paragraphs. Also the overall tone is way too enthusiastic.
I strongly prefer codex. Claude is annoying. Codex provides descriptions where I want them and more touchpoints to audit the quality of work. Claude code on experimental seems to not even show diffs when asked anymore, and it's much less clear what is being shipped.
Dunno, I prefer GPT 5.5 too for the same reasons as the parent. Extremely subjective but had better results with it too. Maybe I just got unlucky with Claude a few times, but even the latest Opus was dumb.
Fascinating how people have such complete diametrically opposed experiences. I guess both models have it in them to behave very differently in different circumstances and we have very little idea what pushes them in this or that direction. I guess it does boil down to luck!
Personally, Claude Opus (and in the few interactions I had with it, Fable) has been the far the superior experience. GPT-5.5 seems dumber and more certain about presenting me bullshit. Opus has better humor, and is less pretentious in its presentation. But this may all boil down to how the models react to my prompting.
What is without a doubt is that I wish they both were more intelligent – or maybe it is their wisdom I find lacking!
It might be harness or prompting style. Personally I use opencode and my prompting style is very plain and terse . Where tasks are very small. Opus and Sonet too often are too verbose and go tangent. Where GPT5.5 is much stricter.
PG got into an argument with AOC about it on Twitter. It sounded like he was personally offended by what she was saying. Which makes sense because, as someone who has helped startup founders become famously wealthy, he probably took her statement as an attack on his identity.
reply