Safari is better than Chrome and FF in enough ways I'd argue it can be considered the best of the three, even to people in tech. The dev tools are just way behind.
Bun has never really been well run. Every feature it had was full of bugs and gaps. And every release fixed a few but broke others.
They released more major features and breaking changes in their last patch release than most software sees in two major versions.
I've been using it just as a script runner and npm package manager basically, and it's incredible the amount of work you have to do to find "good" versions. We've had patch versions suddenly freeze on install more than once, we couldn't upgrade for quite a while due to this. I think they broke postinstall scripts with trustedDependencies entirely two minor versions ago - not a mention in release notes, and somehow no one reporting it in GH issues. In 1.1 or so you could get Bun to do trustedDependency builds in postinstall, and then after that you couldn't. I looked around for release notes and saw nothing mentioned. It's been broken for months.
There's a GitHub issue for the freeze thing. Their security scanner passes the full dep list as CLI arguments, large monorepo on Linux and you blow past ARG_MAX. Spawn silently hangs, no error, --ignore-scripts doesn't help because the scanner is separate from postinstall. Been broken since 1.3.5 at least.
The way I’ve come to think of LLM is that what the produce in a single reply even with thinking turned up, is akin to what you’d do in a single short session of work.
And so if you ask it to do something big it will do a very surface level implementation. But if you have it iterate many times, or give it small pieces each time, you’ll end up with something closer to what a human would do.
I imagine the pelican test but done in a harness that has the agents iterate 10+ times would be closer to what you’d expect, especially if a visual model was critiquing each time.
Yeah, this is how I use AI. Instead of a single session one-shot, it's usually limited to single targeted edits, and then I steer it on each step. Takes longer but the output is actually what I want.
It's significantly worse on Mac than iOS, which gives you the answer. On iOS it's fine, even good. I prefer it, as a designer. On Mac it's a mess, and obviously spent less time baking.
Minimax is nowhere near Opus in my tests, though for me at least oddly 4.6 felt worse than 4.5. I haven't use Minimax extensively, but I have an API driven test suite for a product and even Sonnet 4.6 outperforms it in my testing unless something changed in the last month.
One example is I have a multi-stage distillation/knowledge extraction script for taking a Discord channel and answering questions. I have a hardcoded 5k message test set where I set up 20 questions myself based on analyzing it.
In my harness Minimax wasn't even getting half of them right, whereas Sonnet was 100%. Granted this isn't code, but my usage on pi felt about the same.
reply