More

nwienert · 2026-05-05T16:51:59 1777999919

Safari is better than Chrome and FF in enough ways I'd argue it can be considered the best of the three, even to people in tech. The dev tools are just way behind.

nwienert · 2026-05-04T20:37:33 1777927053

Bun has never really been well run. Every feature it had was full of bugs and gaps. And every release fixed a few but broke others.

They released more major features and breaking changes in their last patch release than most software sees in two major versions.

I've been using it just as a script runner and npm package manager basically, and it's incredible the amount of work you have to do to find "good" versions. We've had patch versions suddenly freeze on install more than once, we couldn't upgrade for quite a while due to this. I think they broke postinstall scripts with trustedDependencies entirely two minor versions ago - not a mention in release notes, and somehow no one reporting it in GH issues. In 1.1 or so you could get Bun to do trustedDependency builds in postinstall, and then after that you couldn't. I looked around for release notes and saw nothing mentioned. It's been broken for months.

nulltrace · 2026-05-04T22:48:01 1777934881

There's a GitHub issue for the freeze thing. Their security scanner passes the full dep list as CLI arguments, large monorepo on Linux and you blow past ARG_MAX. Spawn silently hangs, no error, --ignore-scripts doesn't help because the scanner is separate from postinstall. Been broken since 1.3.5 at least.

nwienert · 2026-04-25T17:58:32 1777139912

No counterfactual there either though.

nwienert · 2026-04-24T00:10:12 1776989412

With one paragraph in your agents.md it's fixed, just admonish it to be proactive, decisive, and persistent.

reactordev · 2026-04-24T01:41:47 1776994907

If only…

I literally had to write a wake up routine.

https://github.com/gabereiser/morning-routine

nwienert · 2026-04-24T03:51:20 1777002680

It's always changing, but this is the start of my default prompt:

https://gist.github.com/natew/fce2b38216edfb509f7e2807dec1b6...

I've had 0 issues with Codex once it adopted it. I use it for Claude too, which seems to also improve its continuation.

It was revised for friendliness based on the Anthropic paper recently, I'd have been a lot less flowery otherwise.

nwienert · 2026-04-20T17:34:27 1776706467

The way I’ve come to think of LLM is that what the produce in a single reply even with thinking turned up, is akin to what you’d do in a single short session of work.

And so if you ask it to do something big it will do a very surface level implementation. But if you have it iterate many times, or give it small pieces each time, you’ll end up with something closer to what a human would do.

I imagine the pelican test but done in a harness that has the agents iterate 10+ times would be closer to what you’d expect, especially if a visual model was critiquing each time.

slopinthebag · 2026-04-20T19:28:06 1776713286

Yeah, this is how I use AI. Instead of a single session one-shot, it's usually limited to single targeted edits, and then I steer it on each step. Takes longer but the output is actually what I want.

nwienert · 2026-04-20T04:10:35 1776658235

4.5 was clearly better than .6 and .7. Like, clear as day.

.6 is some sort of quantized or distilled .5 with a bit more RL, and the current .5 is that same cost reduced model without the extra RL.

nwienert · 2026-04-19T02:17:57 1776565077

Agree on all counts, 4.5 was a monster, 4.6 a clear regression, and then 4.5 was dumbed down so I moved on.

nwienert · 2026-04-13T08:32:59 1776069179

It's significantly worse on Mac than iOS, which gives you the answer. On iOS it's fine, even good. I prefer it, as a designer. On Mac it's a mess, and obviously spent less time baking.

nwienert · 2026-04-07T16:01:26 1775577686

Minimax is nowhere near Opus in my tests, though for me at least oddly 4.6 felt worse than 4.5. I haven't use Minimax extensively, but I have an API driven test suite for a product and even Sonnet 4.6 outperforms it in my testing unless something changed in the last month.

One example is I have a multi-stage distillation/knowledge extraction script for taking a Discord channel and answering questions. I have a hardcoded 5k message test set where I set up 20 questions myself based on analyzing it.

In my harness Minimax wasn't even getting half of them right, whereas Sonnet was 100%. Granted this isn't code, but my usage on pi felt about the same.

nwienert · 2026-04-02T19:20:00 1775157600

I found it worse, in a very clear way.