> However, I'm not convinced this shows GPT 5.5 being that much better than Opus 4.7. It could very well be the harness around it, the system prompts used in the harness and tools available.
A model that can more effectively make use of the tools presented to it is going to be better. You're not wrong about the system prompt; these can have quite a pronounced effect, especially when what the agent is bridging to is not just a case of bash + read/write; you need the prompt (and tool descriptions) to steer and reinforce what it should actually do because most models are heavily over-trained on executing bash lines.
When it comes to more basic agent usage that just runs in a terminal and executes bash ultimately most models are going to do just fine as long as you provide the very basics.
Regarding your case in this post it could be any number of issues: The provider being over-provisioned, leaving less time for your case, the model just not being particularly great, your previous context (in your original session) subtly nudging the model to not do the correct thing, and so on.
The truth is that you simply can't really know what the exact cause of this behavior you experienced is, but I think you're also working hard to cope on behalf of Anthropic.
All in all I think you're placing a bit too much faith in agents and their effect. If you slim down and use something like Pi instead you'll likely get a more accurate sense of what agents do and don't do, and how it affects things. You can then also add your own things and experiment with how that impacts things as well.
I've written an agent that only allows models to send commands to Kakoune (a niche text editor that I use) and can say that building an agent that just executes bash + read/write in 2026 is probably the easiest proposition ever. I say this because a lot of the work I've had to do has been to point them in the direction of not constantly trying to write bash lines; models all seem to tend towards this so if you just wanted to do that anyway most of your work is already done. The vast amount of the work in those types of agents is better spent fixing model quirks and bad provider behavior in terms of input/output.
> A model that can more effectively make use of the tools presented to it is going to be better.
Of course. What I was getting at is that if the harness A doesn't expose certain useful tools that harness B does, it doesn't matter if the model could use those tools.
> I think you're also working hard to cope on behalf of Anthropic
How on earth did you get that out my post? I was just reporting on a recent experience I had, to make a point that harness+model is a very different thing from just model when it comes to evaluating effectiveness and quality of output.
Erlang isn't a popular BEAM language outside of companies and by this point Elixir has probably taken a good chunk of that just by capturing people outside of that context enough as well.
With that said there is almost no reason to use Elixir, it doesn't provide enough on top of Erlang to be meaningful and Erlang as a language is considerably easier to learn, so if you onboard people you'll have a better time. Erlang also has less features you probably shouldn't use and you'll end up with more simple, straight forward code than with Elixir in the long term.
For people who like a bit of order in their programming Gleam is a better alternative to both of these; it just has a regrettable view of OTP, the libraries around processes do not seem very well thought out with regards to providing a similar experience to Erlang/Elixir when writing `gen_server`s, for example.
I haven't used whatever they've shipped in terms of type system, it's possible that that would be a positive differentiator for Elixir now.
With regards to `mix`, you can use literally only that if you like it and just set up `erlc_paths = ["erl_src"]` and write your actual stuff in Erlang in that path.
It's also fine to just use `rebar3`, to be honest; `mix` as a tool is not such a large difference that it should make you more interested in Elixir, IMO.
Counter-point: I built an agent that can only interface with Kakoune, a much less common and more challenging situation for an LLM to find itself in, and Gemma4-A4B 8bit quantized does remarkably better in actually figuring out how to get text in buffers than Qwen3.6-35B-A3B in a similar class as Gemma4 A4B.
Now, is this the usual use case? No, it's a benchmark I created specifically in order to put LLMs in situations where they can't just blast out their bash commands without having to interface with something else and adapt.
I'm just messing around with building agents, that's all. I'm not super interested in making ones that just sit in a terminal executing shell scripts because truth be told they're absolutely trivial to make and don't show any interesting parts of LLMs, whereas telling an agent that they are sitting in Kakoune is a whole lot more interesting and really shows a lot of what LLMs aren't great at, and how they'll have to fight their urge to spit out overwrought bash invocations or at the very least find a way to fit those into something new.
So far the only tools the agent has access to are `evaluate_commands(commands=["...", "..."])` and `get_buffer_contents()`, which really makes them have to work for doing things. I could make it super easy for them but then it wouldn't be an interesting experiment.
If I were to try to make something more useful out of this, I'd probably add the ability for LLMs to list buffers, probably give them an easier out for executing shell scripts in the way they prefer, give them an easier time to list docs and a few other things like that.
The tools and the interaction with Kakoune is really trivial to write; I already use this by having the agent write to the session FIFO (a very simple binary format) and I extract information via my own FIFO that Kakoune writes to (this is used for the buffer data only right now).
I think once you started using it more as a tool and not a pseudo-benchmark like I am you'd probably think of even more things to add but a lot of it comes down to just making Kakoune's state visible and making shell spam (which the LLMs love) easier.
The staff sporadically saying "AI!" is a very nice touch. I found myself trying to get as many servers and load as possible but having the temperature stay stable. This is a neat little game, in a way (though it seems to auto-stabilize too easily).
Edit: Oh, there was a containment breach. My bad, humanity...
I cannot imagine how the people who live close to this type of nonsense feel, though, it must be horrible.
I'm basing this mostly on the segments that were done on how people in rural areas (generally quieter) were experiencing all kinds of issues from living too close to data centers.
I really liked this write-up; this is the type of LLM content that I actually want to read from these people, where they give a window into their world of putting together this odd artifact and we can empathize.
> a misunderstanding in the users mind about how the agent work
On top of that the agent is just doing what the LLM says to do, but somehow Opus is not brought up except as a parenthetical in this post. Sure, Cursor markets safety when they can't provide it but the model was the one that issued the tool call. If people like this think that their data will be safe if they just use the right agent with access to the same things they're in for a rude awakening.
From the article, apparently an instruction:
> "NEVER FUCKING GUESS!"
Guessing is literally the entire point, just guess tokens in sequence and something resembling coherent thought comes out.
Just had one today where GPT-5.4, instead of adding the 10 lines I asked for (an addition that could be done pretty mechanically by just looking at some previous code and adding a similar thing with different/new variable names) proceeded to rewrite 50 lines instead, because it was "cleaner". It was not. It also didn't originally add the thing I asked for either, which was perplexing.
Over-editing is definitely not some long gone problem. This was on xhigh thinking, because I forgot to set it to lower.
Cursor seemingly went out of their way to not mention that they were actually running Kimi K2.5 and essentially by that omission made it seem like they had made their own model. They added a note to a blog post about using it at some point and then when they wrote a new one they conveniently left it out again.
> Claude can so the expert stuff with a non-expert (at least in their mind)
Opus is far better at most surface-level tasks than it is at tasks that require deep knowledge and understanding of domains; someone who is a complete generalist (who thus has only surface level knowledge in many, many things) is far more replaceable with LLMs than someone who has deep knowledge in one.
Consider the way LLMs actually are created; they are not created from billions of repos with deep knowledge behind them. The majority of their knowledge comes from a massive amount of surface-level work that's been done and can be sampled from: React starter templates, starter templates + what little customization someone needed, blog-tutorial-level stuff.
> > But for the other 95% of people, being able to just say "ok can you make it look more modern" and have 4 variants in 5 mins, (like me) Figma will lose users like me.
This does not describe thoughtful, good work. At best, this will be a one-armed bandit deal where you're gambling on something good in these 5 minutes. It sure sounds like a scenario where you will mostly accidentally end up with something good.
A model that can more effectively make use of the tools presented to it is going to be better. You're not wrong about the system prompt; these can have quite a pronounced effect, especially when what the agent is bridging to is not just a case of bash + read/write; you need the prompt (and tool descriptions) to steer and reinforce what it should actually do because most models are heavily over-trained on executing bash lines.
When it comes to more basic agent usage that just runs in a terminal and executes bash ultimately most models are going to do just fine as long as you provide the very basics.
Regarding your case in this post it could be any number of issues: The provider being over-provisioned, leaving less time for your case, the model just not being particularly great, your previous context (in your original session) subtly nudging the model to not do the correct thing, and so on.
The truth is that you simply can't really know what the exact cause of this behavior you experienced is, but I think you're also working hard to cope on behalf of Anthropic.
All in all I think you're placing a bit too much faith in agents and their effect. If you slim down and use something like Pi instead you'll likely get a more accurate sense of what agents do and don't do, and how it affects things. You can then also add your own things and experiment with how that impacts things as well.
I've written an agent that only allows models to send commands to Kakoune (a niche text editor that I use) and can say that building an agent that just executes bash + read/write in 2026 is probably the easiest proposition ever. I say this because a lot of the work I've had to do has been to point them in the direction of not constantly trying to write bash lines; models all seem to tend towards this so if you just wanted to do that anyway most of your work is already done. The vast amount of the work in those types of agents is better spent fixing model quirks and bad provider behavior in terms of input/output.
reply