It's because they do things that is why they score differently. Coding hardness add features for user experience not for agent efficiency. If they did all the coding hardnesses would be using bash and code mode and letting the agents write code to perform tasks but this doesn't work because you want humans in the loop. You want users to be able to approve and deny writes. You want uses to see edits. So you have to build tool for these. It's hard to show diffs when the agent is just using bash.
> The added features are just user experience features.
> It's because they do things that is why they score differently.
That was my point. Regardless of how you feel about UX, it's a value added set of features. The question initially posited, stands. Why would a company do any of these things?
> Coding hardness add features for user experience not for agent efficiency.
Pretending it was always about some metric you just decided was important is moving the goalpost. It's not compelling.
I think it makes more sense that it's Freemium Dominance or they act as Low-Cost Marketing tools.
The reason I have a dog harness is to distributes weight so I don't choke her when she goes at the other dog that she doesn't like. I'm actually puzzling over kids kamikazeing into cars
It's actually only a problem if it's the other way around, isn't it?
If kids run into a car, they will most probably just bounce and continue, perhaps inflicting some minor damage. But if a car mows down a kid, that could well be a fatal injury. Leashes for all the cars! ;)
Try Kimi in Kimi CLI and Claude Code and try saying that again. Kimi quickly collapses into tool calling loops without measures in their CLI but not in Claude Code and is largely useless for any long running tasks in harnesses not taking this into account.
With those measures (which are actually quite interesting) it can at times perform at Sonnet level.
Building a good and working coding harness with smaller models is really hard. Everything evolves around the limited context size.
Tools must be specification driven to reduce noise and high temp hallucinations, tool call shrinking needs to remove errors and tryouts of different formats of parameters (because LLMs always ignore descriptions in the JSON...), and you have to deal with long running agents because you can't afford them. Planner/orchestrator architecture, agent to agent communication need to be summarized, and then you have the messed up scheduling parts, because you need to prioritize short running agents and give the planner a tool to wait for outputs of spawned contractor agents.
And that's not even talking about sandbox vs playground read/write/access policies of tools.
Harness engineering, if done correctly, is quite hard.
And all of this works 60% of the time, every time.
Anyways, that was somewhat the summary of the last 6 months building my exocomp agentic environment. And it's still not satisfying to work with.
In my limited experience, the smaller the model, the bigger the harness. Where with something like claude or deepseek the context size etc just let's you give it bash access and step back; small models tends to do better with simple action - response , new context each call. Context management becomes a continuous activity. Its a fun space , and I have found big models decent at building and improving these harnesses for the small ones. Using /loop and just run a continuous test - build - test loop.
It's kinda weird to think the Chinese AI labs might be more trust worthy than the US labs.
- Anthropic is ran by a bunch of nut jobs.
- OpenAI is ran by a guy you can't trust.
I don't even know if we should include DeepMind, Meta, or xAi in the conversation of AI labs at this point since they can't produce models better than Chinese labs.
To be fair, nerfing Claude on frontier research tasks is consistent with Anthropic's stated beliefs. So in that sense you can trust them to always behave consistently if strangely. But this launch was done very poorly with the lack of transparency on when the frontier research policy was violated.
Yeah and their belief are fucking crazy and dangerous. They are literally sabotaging their users. They built in malware into their model if you prompt it about training a fucking AI model. It doesn't tell you, no it literally sabotages you by editing your prompt and intentionally goes against your request.
You want fucking nut jobs like this building models?
It's one thing to build safeguards on your model and have it prompt the user back. I'm sorry I can't help you with this request. Chinese models do this for some requests.
It's another thing to actively try to make the model perform worst for your user on purpose because it asked the model to do something you, the model creator, didn't like.
Imagine someone is asking a logical medical question and the model swaps the prompt and purpose being less intelligent and gives bad advice to this person.
How do these people not understand they are stupid.
Is it really crazy to nerf a proprietary model to prevent it from training another model? I don't think that's even remotely similar to giving bad medical advice.
It’s not a nerf, it’s sabotage. That’s different. This is like if you’re driving a car and it detects your pulling up to a competing dealership so it cuts the brakes.
This is, in my mind, effectively malware. We don’t know exactly what code the model will inject, and we certainly don’t know when it will happen. It could very easily introduce vulnerabilities.
Given that the "proprietary" model is built on stolen work at an unprecedented scale, it's at the very least hypocritical to a degree that would not be possible without a fundamentally amoral mindset.
Every model release is just proof that AGI will most likely only be for the rich. We are a few years into LLMs and majority of people are already getting priced out of intelligence from LLMs and these are no where near AGI.
This is like looking at mainframe pricing in 1990 and concluding that PCs will only be for the rich. The price of each new level of capability is going to drop like crazy very quickly. It won't be that long before practically any consumer use case will be possible on models that are dirt cheap.
Improvements in model performance aren't always strictly compute-constrained in a way that makes them reliant on Moore's Law. Open weight models-- in particular, from Chinese labs-- are optimizing model intelligence with less compute. They're "behind" frontier models by months, but as others have noted, it's possible to get Sonnet 4.5+ level performance at reduced cost, today, from open weight labs.
No, I'm not assuming Moore's law. The efficiency of AI datacenters will continue to improve even without Moore's law, but more importantly the efficiency of packing intelligence into gigabytes and FLOPS will improve by leaps and bounds over the coming years, just as it has for the past few years if not faster.
Then you're assuming an efficiency that is analogous to how Moore's law made it efficient for chips. Same difference. The problem is that AI scaling in the longest term is a completely unknown problem.
Training improvements and Moore's Law are "analogous" but not "same difference." They are far from the same thing, governed by completely different factors, and one can happen and has been happening independently from the other.
Well I never said nor meant that, rather, my third (3) sentence should've hinted that I already believe what you are saying in your second sentence (2). Whereas my second (2) sentence was handwaving at the notion that if the parent commenter's remark (about improvment trends) were to be assumed then the rational argument must be subject to the same standards, ergo same difference (in argument standards). (Also I use a phone, please excuse any confusion due to not spelling out my online opinions in full)
To clarify another way, it seems the parent commenter and obviously many, many lay people seem to think ALL sorts of technology improves eventually and are always very assured of that. That's a common mistaken premise or axiom used in their arguments. (Arguably Moore's law (up until now) has been a factor in confounding this observation because so much other tech has historically benefited from it directly or indirectly)
Sorry, but a plain reading of your comment does not imply at all that you agree with me, rather the opposite. I'm not basing my opinion on any mistaken axiom of inevitable technology improvement, of course. I'm projecting obvious trends of the past few years which are overwhelmingly likely to continue in the medium term.
"Same difference" could only mean that you believe my argument should fail in the same way as an argument based on Moore's law. If that's not what you meant then you should have used different words. If that is what you meant, with the justification that "AI scaling in the longest term is a completely unknown problem", I disagree with that too.
In the "longest term" the ultimate scaling of AI doesn't matter for the original question of whether "AGI will most likely only be for the rich". Nobody looks at the TOP500 list today and says "computing is only for the rich". This is because we have an abundance of iPhones and gaming PCs in the consumer market, providing practically any application of computing that a consumer could want at very attainable prices. Similarly, practically any application of AGI will be accessible to consumers at attainable prices. Continued AI scaling after a certain point will be relevant mostly to industry (whose products will still be priced attainably, analogously to the way weather forecasts produced on TOP500 supercomputers are readily accessible to the public today).
Its a quadratic graph. It starts low but not that capable, gets better and more expensive, and then the time comes in which the capability needed is not the ones of the frontier models and then the price goes down on the companies who host the models that the capability is "good enough"
You are only priced out if you only care for SOTA right now and can't wait for the inevitable cheap model coming in 6 months. DeepSeek, Xiaomi and Moonshot are already really cheap and match frontier performance from 6 months ago.
They are not artificially cheap, they are still cheap even when hosted by independent inference providers. Are all providers subsidizing their open-weight models?
Nobody's making profits right now, not because they're selling tokens for less than their cost but because they're always investing in the next bigger model.
"Now, we are collaborating with Google and NVIDIA to run new Apple Intelligence workloads on Google Cloud, extending our industry-leading PCC privacy commitments to third-party data centers for the first time."
Per that link: I think there's an interesting question about whether a nefarious actor who's infiltrated a cloud provider with physical access to machines that are running signed operating systems, with signed binaries, with TDX remote attestation, and with hardware supply chain verification, has the ability to break the privacy guarantees of a tenant with Apple's sophistication.
Certainly, one could tamper with the hardware, but could one do it in a way that wouldn't get that machine immediately flagged, removed from the routing pool, and told to wipe its memory immediately, by a watchtower (perhaps even the routing layer itself) that runs in a separate secure Apple datacenter?
Those datacentres would be in the same position of trust as a VPN provider in that the data must be unencrypted at points in the process.
They could be making it very safe, and the things apple says they are doing would make it as safe as possible, but as a user there is no way of verifying the claims.
> as a user there is no way of verifying the claims
I think this sums up what it's like to be an Apple user pretty well. With their heavy proprietary and closed approach, all users can do is "trust" them.
The previous argument was wrong and imprecise, as it could be used against any modern technology, none of which can be fully understood by a user, in the sense that any vulnerability would be completely invisible.
It’s clear they have made a very intelligent approach to this system.
Apple could simply be ordered to include a hardware backdoor, and legally be prevented from talking about it. Everything else in the architecture could work exactly the way they claim in the PCC paper.
Wrong answer. Or at least, obvious and not particularly useful.
Truth is, none of those parties are "nefarious" - they're all just not on your side. And "security" is never an unqualified good thing to have (it's not an unqualified bad thing either). It's just a framework of coercion.
The most important questions to answer about any security system is, what is being protected, for who, and from who. People don't ask that much, not even in the industry - it's an implicit assumption that everyone themselves is a "good person" and is on the protected side of security systems. And then they're confused because it turns out end-users are more often seen as threat actors. All the players mention, but perhaps especially Apple, in its own special way, is protecting the computer from the user just as much as they're protecting the user/user's data from third parties.
Why bother with all that cloak and dagger stuff when they can just buy the data? You believe Apple and/or Google isn't selling it? I have some land in Florida I'd like to talk about.
Having worked at Apple, I will say I firmly believe they do not sell data. I worked in data science and we had the shittiest inference because we had essentially no access, even internally, to longitudinal or cross-app user data. Best we had was 15 minute rotating sessions for a single app. There are internal teams dedicated to deanonymizing data to try to narrow down users - if they can successfully do so, and relevant fields that lead to deanonymization get permanently purged from internal logging.
I can’t speak to the current architecture but Apple has shown a consistent willingness to sacrifice access to user data in the name of selling privacy instead at a premium price (you could argue precisely because no one of their competition have any meaningful posture on this). I do believe they are quite serious in their commitment to that, as they have found this strategy to be more valuable than the data itself.
This comment makes it sound like they sold private recordings to whomever was willing to pay for them, but they paid third parties to evaluate Siri recordings.
Illegal to share data with entities that are themselves law enforcement, and which they are known to be demanding, not just asking to share out of good will?
Apple's incentives don't align to sell private data as their whole thing is privacy. They do that they tank their business. If you have proof that they are doing it -- I'd love to see it. (*3rd party actors from an app re-selling data doesn't count)
Google is 100% doing that because thats their entire incentive for the business. They sell low cost software / subsidized hardware on the grounds that you pay with your sharing data. That's the implied cost.
Show me the incentives - I will show you the outcomes.
That’s not so special, though? There’s a difference between Google infra running Google services.
Versus any F500 company running their services on GCP.
It’s a bit whacky to think about because Apple will operate Google owned software on GCP. But it should be sandboxed just the same.
I’m not making a normative privacy argument here. Just pointing out that this is cloud business as usual. Perhaps it’s interesting Apple is doing it, but basically everything else is already using either AWS or GCP at this point.
I think the difference is scale. This is Apple, so it's an enormous amount of devices. And it's a seamless experience, to the user, going from local model to cloud models.
So the question about which model Apple was going to use and where has been highly anticipated, especially by the likes of OpenAI and Anthropic. Imagine if either one could say they have Apple as their customer?
Apple certainly has the cash to burn if they wanted to train their own model, but it also always seemed out of their core competency. This is a major win for Google.
So "business as usual" but with huge implications for the AI ecosystem in general.
>(Part of their software are models derived from Google Gemini, but that’s orthogonal to this)
You're right that it is orthogonal to the privacy promises Apple makes to its own users.
The moralistic and righteous undertone in their marketing material is questionable though given that these Apple services might not exist if Google didn't exploit Gemini app user data on Android the way it does.
That's fine with me. Users have a choice here. In fact, it's a big improvement over the search deal with Google where Apple sends its own users directly to Google.
That is news — I guess not very surprising that they'd need more data centres than before.
But again there is no Apple-to-Google transfer in the inference in the sense of the comment I was originally replying to (I am not suggesting you're implying otherwise, obviously)
But I stand happily corrected where I said they aren't in the picture at all.
That is an interesting press release because it outlines what they would have had to do with any data centre they were outsourcing to.
This is probably why Google had to rent compute from SpaceX. They needed to free up NVIDIA GPUs for Apple so they probably moved internal workloads to SpaceX compute.
Google likely won't rent compute from SpaceX, they have a substantial share of SpaceX (they own 5% of it) and need the IPO to be valued highly, so to prop up the IPO stock, they made this announcement, but if you read the fine print, both SpaceX and Google are allowed to cancel it at any time, as-in, after they cash out from the IPO.
I don't think you understand the size of the US capital market. We are talking probably ~150 trillion.
It's easy as fuck for Google to raise this money because they are a money printing business. They are the most profitable company in the world, so for anyone this is basically the same as buying US debt.
Yeah, but Google has the money for this. They are quite literally the most profitable company in the world. They are only raising because they don't want to harm there other businesses buy eating up their capital for this.
> Yeah, but Google has the money for this. They are quite literally the most profitable company in the world.
"Alphabet announced that its 2026 capital expenditures are expected to be $180-$190 billion, and that it expects 2027 capital expenditures to significantly increase [...] over the 12 months ended March 31, 2026, Alphabet generated $174 billion of operating cash flow"
"If all of this was done to better humanity, AI development would be done in public, data would be legally obtained, models would be released for free, access wouldn't be gatekept behind ever increasing subscription costs."
The vast majority of AI development is public. There are papers literally every single day to read. In fact everything you need to build Claude and GPT models is public. Thanks to Google, DeepSeek, and all the other research labs. There are more research labs than there are closed shops. In fact there really is only one Anthropic, and lately maybe OpenAI. Google still releases papers all the time on AI.
There are more open source models than closed source models and all of them are accessible without a subscription. Yeah you still need to pay for them, but hey as we build out infrastructure and more time is put into efficient models today will easily run on person compute of the future.
What do you mean with "open-source"? Of course, the inference code for all the open weight models is publically available - see llama.cpp or hf transformers.
There are, however, very few models where also the full training pipeline is available. Olmo by AI2 comes to mind.
Harnesses aren't really going to change much of the performance on models like Opus, and GPT.
You literally can just give the model a bash tool and it will do just fine in fact it will most likely do better than majority of harnesses due to how well models are at bash.
The model do all the lifting. It really doesn't matter which harness you use.
People need to stop thinking that LLMs actually know what they are. They don't. They don't know their Qwen, They don't know they are Opus. They don't even know they are an LLM.
This is why literally every single model's system prompt starts with something like:
"You're Claude Opus a large language model from Anthropic"
They can easily add this "individuality" in RLHF. The base model won't know, correct, but the final user-facing model very much can if that's what they so desire.
Crazy they bring up honest, when Claude models are literally known for straight up lying about things it has done and tries to act like it did what you asked.
A coding hardness with just bash outperforms Codex, Claude Code, OpenCode, Pi ect. The added features are just user experience features.
reply