I think the difference is that with LLMs, in a lot of cases you do see some diminishing returns.
I won't deny that the latest Claude models are fantastic at just one shotting loads of problems. But we have an internal proxy to a load of models running on Vertex AI and I accidentally started using Opus/Sonnet 4 instead of 4.6. I genuinely didn't know until I checked my configuration.
AI models will get to this point where for 99% of problems, something like Gemma is gonna work great for people. Pair it up with an agentic harness on the device that lets it open apps and click buttons and we're done.
I still can't fathom that we're in 2026 in the AI boom and I still can't ask Gemini to turn shuffle mode on in Spotify. I don't think model intelligence is as much of an issue as people think it is.
100% agree here. The actual practical bottleneck is harness and agentic abilities for most tasks.
It's the biggest thing that stuck out to me using local AI with open source projects vs Claude's client. The model itself is good enough I think - Gemma 4 would be fine if it could be used with something as capable as Claude.
And that's gonna stay locked down unfortunately especially on mobile and cars - it needs access to APIs to do that stuff - and not just regular APIs that were built for traditional invoking.
The same way that websites are getting llm.txts I think APIs will also evolve.
GPT 3.5 was intelligent enough to understand that command and turn it into a correct shaped JSON object: the platforms don't have tight enough integration to take advantage of the intelligence
I think security is the issue-ai is good at circumventing this. For example , ai can read paywalled articles you cannot. Do you really want ai to have ‘free range’.?
I mean to me even difference between Opus and Sonnet is as clear as day and night, and even Opus and the best GPT model. Opus 4.6 just seems much more reliable in terms of me asking it to do something, and that to actually happen.
It depends what you're asking it though. Sure, in a software development environment the difference between those two models is noticeable.
But think about the general user. They're using the free Gemini or ChatGPT. They're not using the latest and greatest. And they're happy using it.
And I am willing to bet that a lot of paying users would be served perfectly fine by the free models.
If a capable model is able to live on device and solve 99% of people's problems, then why would the average person ever need to pay for ChatGPT or Gemini?
But even other tasks, like research etc, where dates are important, little details and connections are important, reasoning is important, background research activities or usage of tools outside of software development, and this is where I am finding much of the LLMs most useful for my life.
Even Opus makes mistakes with dates or not understanding news and everything correctly in context with chronological orders etc, and it would be even worse with smaller and less performing models.
My experience is very different than yours. Codex and CC yield very differenty result both because of the harness differencess and the model differences, but niether is noticeably better than the other.
Personally, I like Codex better just because I don't have to mess with any sort of planning mode. If I imply that it shouldn't change code yet, it doesn't. CC is too impatient to get started.
I guess yes, that's a harness difference, and you can also configure CC as a harness to behave very differently, but still with same harness and guidance, "to me" there's still a difference in terms of Opus 4.6 and e.g. GPT 5.4 or which GPT model do you use? I've been using Claude Code, Codex and OpenCode as harnesses presently, but for serious long running implementation I feel like I can only really rely on CC + Opus 4.6.
I come from Cursor before having adopted the TUI tools. Opus was nothing short of pathetic in their environment compared to the -codex models. I would only use it for investigations and planning because it was faster.
Like you've said, though, that could just be a harness issue.
I have the opposite experience. Codex gets to work much faster than Claude Code. Also I've never seen the need to use planning mode for Claude. If it thinks it needs a plan it will make one automatically.
"XYZ Corp" won't allow their developers to write their desktop app in Rust because they want to consume only 16MB RAM, then another implementation for mobile with Swift and/or Kotlin, when they can release good enough solution with React + Electron consuming 4GB RAM and reuse components with React Native.
Strangely enough, AI could turn this on its head. You can have your cake and eat it too, because you can tell Claude/Codex/whatever to build you a full-featured Swift version for iOS and Kotlin for Android and whatever you want on Windows and Mac. There's still QA for the different builds, but you already have to QA each platform separately anyway if you really care that they all work, so in theory that doesn't change.
Of course, it's never that simple in reality; you need developers who know each platform for that to work, because you must run the builds and tell the AI what it's doing wrong and iterate. Currently, you can probably get away with churning out Electron slop and waiting for users to complain about problems instead of QAing every platform. Sad!
> The simple fact is that a 16 GB RAM stick costs much less than the development time to make the app run on less.
The costs are borne by different people: development by the company, RAM sticks by the customer.
A company is potentially (silently?) adding to the cost of the product/service that the customer has to bear by needed to have more RAM (or have the same amount, but can't do as much with it).
Yep, and since companies care about TCO, they reward the software with the lower TCO, which happens to be the one that uses more RAM but is cheaper to produce.
Some software has millions or even billions of users. The cost of 16 GB multiplied by million millions or billions would pay for a lot of refactoring.
That said, I think it’s more of a collective action problem. The person who could pay for the refactor to operate in 640 K is not the same person who has to pay for the 16 GB. And yes, the 16 GB is cheap enough in comparison to other costs that the latter group doesn’t necessarily notice that they are subsidizing inefficient development.
I think stavros means amortization on an individual level - if all software is bloated and requires 16GB to run then my expense for a 16GB stick is not caused by a single piece of software, but everything I use.
Not that I agree of course :) I’m talking more of the net negative of everyone needing to buy 16gb sticks so developers can YOLO vibe-coded unoptimized garbage. But at least I think the former explanation is what stavros meant :)
People get hung up on bad optimization. It you are the working at sufficiently large scale, yes, thinking about bytes might be a good use of your time.
But most likely, it's not. At a system level we don't want people to do that. It's a waste of resources. Making a virtue out of it is bad, unless you care more about bytes than humans.
These bytes are human lives. The bytes and the CPU cycles translate to software that takes longer to run, that is more frustrating, that makes people accomplish less in longer time than they could, or should. Take too much, and you prevent them from using other software in parallel, compounding the problem. Or you're forcing them to upgrade hardware early, taking away money they could better spend in different areas of their lives. All this scales with the number of users, so for most software with any user base, not caring about bytes and cycles is wasting much more people-hours than is saving in dev time.
Creating people able to do these optimizations costs human life, which is not spend on other things, like building the unoptimized version of another product.
We're not talking about writing assembly by hand here. If your software has a million daily users and wastes a minute of their day, that's about 9 work-years of labour wasted every single day.
In a 5-year lifecycle that's about 10,000 years of human labour wasted. Yes, I had to quadruple-check this myself.
Does it take 10,000 work-years of effort, per project, to train its developers to write reasonably performant code?
Of course not all of this would translate into actual productivity gains but it doesn't have to.
The one we're in where "software" doesn't just mean an app that someone downloads from a website or an app store. Software includes lots of server side components, etc, etc.
I once noticed my name in the Chromium OS credits due to a patch I had submitted to a library that's on every Chromebook. 1 million would be a small number for Chromebooks alone.
I'm not talking about the median piece of software with 2 users and 0.1 developers (I made that up).
The ones that stick out are actively maintained, widely used, and well funded. It doesn't have to be a million active users, but they should be the first to get their act together.
Look at the whole history of computing. How many times has the pendulum swung from thin to fat clients and back?
I don't think it's even mildly controversial to say that there will be an inflection point where local models get Good Enough and this iteration of the pendulum shall swing to fat clients again.
Assuming improvements in LLMs follow a sigmoid curve, even if the cloud models are always slightly ahead in terms of raw performance it won't make much of a difference to most people, most of the time.
The local models have their own advantages (privacy, no -as-a-service model) that, for many people and orgs, will offset a small performance advantage. And, of course, you can always fall back on the cloud models should you hit something particularly chewy.
(All IMO - we're all just guessing. For example, good marketing or an as-yet-undiscovered network effect of cloud LLMs might distort this landscape).
My thinkpad is nearly 10 years old, I upgraded it to 32GB of ram and have replaced the battery a couple of times, but it's absolutely fine apart from that.
If AI which was leading edge in 2023 can run on a 2026 laptop, then presumably AI which is leading edge in 2026 will run on a 2029 laptop. Given that 2023 was world changing then that capacity is now on today's laptop
Either AI grows exponentially in which case it doesn't matter as all work will be done by AI by 2035, or it plateaus in say 2032 in which case by 2035 those models will run on a typical laptop.