More

irthomasthomas · 2026-05-27T09:54:37 1779875677

Like the lion in the zoo that gets food thrown to him.

wwalexander · 2026-05-27T10:33:47 1779878027

Who do you think is going to throw food at you?

irthomasthomas · 2026-05-26T20:53:07 1779828787

Insane. 3 points behind opus on the artificialanalysis index.

Mimo cost ~$400 at the old price, so about $40 today. Opus cost ~$5000

That's over 100x cheaper, and just 3 points behind.

I can't wait to experiment with an llm consortium of 100 deepseek and mimo models. Crazy times.

Shut up and take my m̶o̶n̶e̶y̶ data!

Edit: Gemini on google search told me I could write strikethrough text on hn using <s>. Mimo told me it was unsupported and then went on to list some tags that are supported, like <b>bold</b>. I tried copy pasting the word in strikethrough from a word processor but it lost the format. I ended up using mimo in an agent shell wrapper to produce it, and copy pasting from the terminal worked for some reason.

_davide_ · 2026-05-27T07:24:36 1779866676

I had a subscription before the price was cut down; the model kept randomly looping the with same character (burning 30% of the budget in one shot), and the overall performance for agentic purposes is, simply put, terrible. It finds non-existing bugs and randomly removes chunks of code to fix them, then even presents it as an "extra fix". Maybe it's a good generalistic model; I haven't tested it in that regard.

MiniMax (currently 2.7) which is a ~270B model tuned exclusively for agentic purposes, performs so MUCH better; it's more reliable and cheaper. Both are still far away from Opus 4.7 that I'm using at work. IMO benchmarks are just a very rough estimation; everyone cheats as much as they can get away with. Test the model yourself; do not make any assumptions based on the benchmarks.

I would love to see specialized, cheaper, bleeding-edge models like MiniMax for other non-agentic purposes as well. Why pay $1 for a general model when, for example, you can pay $0.1 for a content-moderator model that you actually need?

zarify · 2026-05-27T09:53:13 1779875593

Funny, I had the opposite experience with MiniMax and Mimo when using OpenCode. MiniMax got stuck with looping through broken tool calls all the time and MiMo just powered through things and for the most part just worked.

maxdo · 2026-05-27T03:07:34 1779851254

benchmarks we deserve: google search quick ai answers vs full llm model :)

irthomasthomas · 2026-05-27T09:44:24 1779875064

search answers use Flash 3.5

maxdo · 2026-05-27T14:31:02 1779892262

they use a "low" flavor of it to scale it on billions of users

megous · 2026-05-27T08:47:47 1779871667

So I tried the $16/mo token plan. Burned through 31% of monthly budget in one 1-2h session of a small C project refactoring, saw some not great behavior (hey subagent, read me back these 6 files exactly - which probably burned a lot of output tokens) and will cancel, obviously.

This is waaaaay more constrained than even Claude Pro plan, let alone Deepseek V4 or Kimi K2.6 pricing.

noman-land · 2026-05-26T21:22:46 1779830566

What did MiMo say?

irthomasthomas · 2026-05-26T21:26:36 1779830796

Says its not supported and lists a few tags that are, like <b>bold</b>

Does this work: s̶t̶r̶i̶k̶e̶t̶h̶r̶o̶u̶g̶h̶

noman-land · 2026-05-27T00:56:17 1779843377

Well done. Unicode wins again. 𓂺

irthomasthomas · 2026-05-26T17:00:35 1779814835

I think they are illegal already in most places under the insurable interest doctrine.

Its a small step from betting on ships sinking to making sure they go down.

irthomasthomas · 2026-05-26T00:39:51 1779755991

most cfcs are from industrial freon use, not domestic.

China continued using freon until 2019. They used it make insulation. The gasses will continue leaking from these buildings for a long time.

irthomasthomas · 2026-05-24T18:19:09 1779646749

A lot of words to say that Sam Altman bought up the worlds total supply of ram chips for the next few years.

Auracle · 2026-05-24T19:23:50 1779650630

A dick move or just really prescient?

regularfry · 2026-05-24T19:49:16 1779652156

It's only prescient if it works out. But it's a dick move either way.

irthomasthomas · 2026-05-19T19:48:56 1779220136

This is a perfect illustration of something I noticed with llm progress. Ask them to improve an svg like this, and it never fixes the missing crossbar or disconnected limbs, it just adds more stuff. In this example they have obviously improved greatly, and it contains a ridiculous amount of detail, but they still to get the basic shape of the frame wrong. It's weird. And the pattern shows up everywhere, try it with a webpage and it will add more buttons and stuff. I've even experimented with feeding the broken pelican svgs to an image model to look for flaws, and they still fail to spot the broken elements.

edit: fixed human hallucination

derefr · 2026-05-19T20:29:57 1779222597

When you say "improve an svg like this", how are you imagining setting that workflow up? Are you just feeding them the SVG to iterate on; or are you giving them access to a browser to look at the rendering of the SVG?

I ask because:

Insofar as the original pelican test is zero-shot, it effectively serves as a way to test for the presence of a kind of "visual imagination" component within the layers of the model, that the model would internally "paint" an SVG [or PostScript, etc] encoding of an image onto, to then extract effective features from, analyze for fitness as a solution to a stated request, etc.

But if you're trying to do a multi-shot pelican, then just feeding back in the SVG produced in the previous attempt, really doesn't correspond to any interesting human capability. Humans can't take an SVG of a pelican and iteratively improve upon it just based on our imagined version of how that SVG renders, either! Rather, a human, given the pelican, would simply load the pelican SVG in a browser; look at the browser's rendering of the pelican; note the things wrong with that rendering; and then edit the SVG to hopefully fix those flaws (and repeat.)

I imagine current (mult-modal and/or computer-use) LLMs would actually be very good at such an "iterative rendered pelican" test.

irthomasthomas · 2026-05-19T20:39:23 1779223163

I'm talking about two type of improvement, model improving, and prompt based improving. I am noticing that the baseline output has a lot more going on, the model has improved, yet it still makes those obvious looking mistakes with the shape of the frame or disconnected limbs etc.

And I am saying that if you take one of these SVGs and ask an LLM to look for flaws, it rarely spots those obvious flaws and instead suggests adding a sunset and fish in the birds mouth.

tskj · 2026-05-20T13:20:29 1779283229

This is also my gripe with a lot of this stuff, always evaluating models on what they can literally oneshot is completely pointless; it's not how anything works, neither for humans nor for scaffolded AIs. I guess it's neat if you want to argue that a certain level of intelligence can "never be achieved" in a single forward pass, but like, so what. No one cares about that, except people who have already decided to be anti AI.

(not that I am in any sense pro AI, but it's just a weird lack of intellectual rigor)

irthomasthomas · 2026-05-20T14:40:23 1779288023

Asking a model to improve its output is not one-shotting tho? My observation was that asking an llm to iterate and improve a response causes it to add more stuff, rather tha repair the broken stuff. And that model progress in general has the same pattern. This new model adds more details to its responses but continues to make mistakes at about the same rate.

losvedir · 2026-05-20T16:15:34 1779293734

The question was whether you were giving it the rendered image and using the model's visual modal capability, or feeding back in the textual SVG.

It's hard to "imagine" what the rendered SVG looks like, for both humans and LLMs, so just iterating on text won't really be as useful of a test. But if you show it what it rendered, it might observe the bad-looking bicycle and be able to fix the text that way.

irthomasthomas · 2026-05-20T17:31:41 1779298301

"I've even experimented with feeding the broken pelican svgs to an image model to look for flaws, and they still fail to spot the broken elements."

stared · 2026-05-19T23:01:46 1779231706

To a certain extent, it feels like a Sonnet 3.7 moment. Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.

When I ask for a pelican on a bike, I want the Platonic ideal of a pelican on a bike, not a vision of an alternative reality in which pelicans created bikes. Though, thinking about it again, maybe I should.

p1esk · 2026-05-20T00:30:35 1779237035

What is “Sonnet 3.7 moment”?

stirfish · 2026-05-20T01:35:14 1779240914

Slightly overeager - you ask for a button color change, you see layout changes, new package dependencies, and the README rewritten from scratch - and not necessarily correctly.

dormento · 2026-05-20T18:58:08 1779303488

Sonnet 3.7 tried its damnedest but it was just kinda "off".

gowld · 2026-05-20T00:22:54 1779236574

It's because LLMs are fundamentally generative (creative), not truth-seeking or logic-seeking. Simple logic has always been incredibly expensive to impossible for LLMs.

Araopa · 2026-05-20T02:05:55 1779242755

So we have to train llms on debugging too, not just how to make things (which you easily train by feeding the outputs).

sosborn · 2026-05-20T01:00:09 1779238809

This matches my experience with human too FWIW.

emp17344 · 2026-05-20T01:12:14 1779239534

Why is there always an identical reply like this when anyone criticizes LLMs?

girvo · 2026-05-19T22:30:58 1779229858

Their ability is best described as "spiky". To steal from aphyr: think kiki, more than bouba. Whats interesting is that a lot of the models seem to have similar spikes and "troughs", though there are differences.

irthomasthomas · 2026-05-19T19:39:25 1779219565

And they are using this to power search answers?

CooCooCaCha · 2026-05-19T20:14:45 1779221685

I bet the API pricing helps pay for search users

irthomasthomas · 2026-05-16T15:52:31 1778946751

I think its less misleading this way because every other reader would have to pay $1.3M to emulate his workflow for a similar size project. His discounted internal costs are relevent only to openai.

Tiberium · 2026-05-16T17:12:02 1778951522

I did mention that you could use ~60 $200 Codex accounts to emulate his workflow without /fast, or 2.5x that if you used /fast. Not $1.3M

irthomasthomas · 2026-05-15T14:29:04 1778855344

I don't believe anything out of these startups anymore unless its backed by evidence.

Too expensive? Why would anthropic train a model too expensive to run? I doubt they would. Let's look at the evidence: Opus 4.5 came in at double the speed and half the price of old opus. Its speed matched older sonnet models. Higher Speed + Lower price = smaller model. So they rebranded sonnet sized models to opus. Where is the og opus sized model?

irthomasthomas · 2026-05-15T09:30:23 1778837423

Arc has no predictive power whatsoever. I always use the best models available. So far I haven't found a task that chineses models cannot solve very quickly and reasonably. Do you have any examples where they failed for you?