OpenAPI is primarily for machine-to-machine which needs determinism and optimized for some cases (e.g. time in unix format with ms accuracy). MCP is optimized for another use case where LLM has many limitations but has good "understanding" of text. instead of sending `{ user: {id: 123123123123, first_name: "XYZYZYZ", "last_name": "SDFSDF", "gender": "..."..... } }` you could return "Mr XYZYZYZ" or "Mrs XYZYZYZ"
llm doesn't need all these and can't parse it anyway without additional tools (e.g. why should it spend tokens even trying to convert unix timestamp to understand the time)
> Writing the code hasn’t been the bottle neck to developing software for a long time
It was!
pre-2022 people needed developers to build software for them, now with platforms like Replit, Lovable - people are creating their own tiny software projects, which wasn't easily accessible in the past.
If you say coding wasn't a bottleneck, then indirectly you could also say, you don't need developers. If you need developers, outcome of their other type of work (thinking, designing based on existing tools and so on) is actually CODE.
"XYZ Corp" won't allow their developers to write their desktop app in Rust because they want to consume only 16MB RAM, then another implementation for mobile with Swift and/or Kotlin, when they can release good enough solution with React + Electron consuming 4GB RAM and reuse components with React Native.
Strangely enough, AI could turn this on its head. You can have your cake and eat it too, because you can tell Claude/Codex/whatever to build you a full-featured Swift version for iOS and Kotlin for Android and whatever you want on Windows and Mac. There's still QA for the different builds, but you already have to QA each platform separately anyway if you really care that they all work, so in theory that doesn't change.
Of course, it's never that simple in reality; you need developers who know each platform for that to work, because you must run the builds and tell the AI what it's doing wrong and iterate. Currently, you can probably get away with churning out Electron slop and waiting for users to complain about problems instead of QAing every platform. Sad!
I am interested as well what future would look like. So far what I am seeing is:
(1) specialized AI agent -> (2) we should add 1790 agents to be competitive -> (3) pivot to agentic workforce platform
now we have lots and lots of agentic workforce platforms and sandbox providers to run them. All have similar capabilities: create agent for HR, create agent for Sales,...
Hope to see something interesting to pop-up, at least it was happening in SaaS-era where people were inventing new ways of solving old problems: DocuSign, Salesforce, Zoho,...
I think both product and engineering is lacking. The only thing that works great today is the LLM model themselves.
Everything is dependent on "agents", but there are either barely any scaffold around them or it is full speghetti, at least it's hard to find one that's well constructed.
For instance, humans zoom around in cars, these cars don't spontaneously combust (most of the time), have seatbelts and airbags, and don't need engine oil replacement every 1 mile. Humans are amazing, the cars are also relatively solidly engineered (at least the ones we drive around today).
The agent product that we have today are decidedly NOT that. Maybe for a single week openclaw was it - and then it decided to add a trawler and a fishhook to the car along with 1000 other addition because why not? And that has been true for almost every one of the LLM/AI product I have seen.
I think the winners here, such as it is, will be the companies that have an actual specialized service that actually does something, where any "agentic" functionality is on top of that.
How is that Meta spent so much money for talent and hardware, but the model barely matches Opus 4.6?
Especially, looking at these numbers after Claude Mythos, feels like either Anthropic has some secret sauce, or everyone else is dumber compared to the talent Anthropic has
Meta did a bunch of mistakes, and look like Zuckerberg spent a lot of money on talent and made big swings to change it (that happened about a year ago)
I think it’s unrealistic to expect them to come back from that pit to the top in one year, but I wouldn’t rule them out getting there with more time. That’s a possible future. They have the money and Zuckerberg’s drive at the helm. It can go a long way.
If they actually matched Opus 4.6 on such a short timeline, it would have been mighty impressive. (Keep in mind this is a new lab and they are prohibited from doing distills.)
Friends at Meta with access to the model + personal experience at Meta.
Meta's performance process is essentially "show good numbers or you're out." So guess what people do when they don't have good numbers? They fudge them. Happens all across the company.
Re: changes, there's been enormous turnover in AI organizations, and in theory this one was developed by a "new" org. Whether that means less or more benchmaxxing is anyone's guess.
More I'd guess since the new org needs to prove itself long enough for stock to vest. Fudge the benchmarks gives them a longer horizon before they're all fired anyways.
Anthropic has just been focused on coding/terminal work longer mostly, and their PRO tier model is coding focused, unlike the GPT and Gemini pro tier models which have been optimized for science.
Their whole "training the LLM to be a person" technique probably contributes to its pleasant conversational behavior, and making its refusals less annoying (GPT 5.2+ got obnoxiously aligned), and also a bit to its greater autonomy.
Overall they don't have any real moat, but they are more focused than their competition (and their marketing team is slaying).
Autonomy for agentic workflows has nothing to do with "replying more like a person", you have to refine the model for it quite specifically. All the large players are trying to do that, it's not really specific to Anthropic. It may be true however that their higher focus on a "Constitutional AI"/RLAIF approach makes it a bit easier to align the model to desirable outcomes when acting agentically.
You think it has nothing to do with it. Even they only have a loose understanding of exactly the final results of trying to treat Claude like a real being in terms of how the model acts.
For example, Claude has a "turn evil in response to reinforced reward hacking" behavior which is a fairly uniquely Claude thing (as far as I've seen anyhow), and very likely the result of that attempt to imbue personhood.
Yup, it's called test-time compute. Mythos is described as plenty slower than Opus, enough to seriously annoy users trying to use it for quick-feedback-loop agentic work. It is most properly compared with GPT Pro, Gemini DeepThink or this latest model's "Contemplating" mode. Otherwise you're just not comparing like for like.
I have not delved into the theory yet but it seems that the smaller open-source models do this already to an extent. They have less parameters, but spend much more time/tokens reasoning, as a way to close the performance gap. If you look at "tokens per problem" on https://swe-rebench.com/ it seems to be the case at least.
no massacre is justified, but can you remind us how and where did Hamas get helicopters and tanks and all of a sudden all cars were smashed? maybe Hannibal directive handed them over their tanks
It was upvoted by so many people actually because of reason and evidence.
Also, please stop using race card, no one is blaming a race, people are pointing out to the country who is carrying out these cruelties and majority of government supporting it and majority of army is executing the commands
they better make billions directly from corporations, instead of giving them to average people who might get a chance out of poverty (but also bad actors using it to do even more bad things)
Anthropic's definition of "safe AI" precludes open-source AI. This is clear if you listen to what he says in interviews, I think he might even prefer OpenAI's closed source models winning to having open-source AI (because at least in the former it's not a free-for-all)
llm doesn't need all these and can't parse it anyway without additional tools (e.g. why should it spend tokens even trying to convert unix timestamp to understand the time)
reply