More

44za12 · 2026-03-04T16:47:24 1772642844

Specialised models easily beat SOTA, case in point: https://nehmeailabs.com/flashcheck

44za12 · 2026-03-04T05:14:34 1772601274

All of us use the same keyboards more or less, maybe us randomly typing a large number is not as random as we would like to think. Just like how “asdf”, “xcyb” are common strings because these keys are together, there has to be some pattern here as well.

palmotea · 2026-03-04T05:30:16 1772602216

Especially for those very large numbers in the top ten (like 166884362531608099236779 with 6779 searches), and the relatively small number of total "votes" (probably less than a million), I think the only likely explanation for their rank is ballot-stuffing.

strongpigeon · 2026-03-04T17:00:41 1772643641

That means there is less entropy than purely random strings, not that this specific number would be so far outside the distribution. My money would be on someone hammering it.

44za12 · 2026-01-26T07:03:53 1769411033

This is the way. I actually mapped out the decision tree for this exact process and more here:

https://github.com/NehmeAILabs/llm-sanity-checks

homeonthemtn · 2026-01-26T12:45:24 1769431524

That's interesting. Is there any kind of mapping to these respective models somewhere?

44za12 · 2026-01-26T13:24:03 1769433843

Yes, I included a 'Model Selection Cheat Sheet' in the README (scroll down a bit).

I map them by task type:

Tiny (<3B): Gemma 3 1B (could try 4B as well), Phi-4-mini (Good for classification). Small (8B-17B): Qwen 3 8B, Llama 4 Scout (Good for RAG/Extraction). Frontier: GPT-5, Llama 4 Maverick, GLM, Kimi

Is that what you meant?

hyuuu · 2026-01-30T20:58:21 1769806701

at the sake of being obvious, do you have a tiny llm gating this decision and classifying and directing the task to its appropriate solution?

andai · 2026-02-01T17:05:40 1769965540

>Before you reach for a frontier model, ask yourself: does this actually need a trillion-parameter model?

>Most tasks don't. This repo helps you figure out which ones.

About a year ago I was testing Gemini 2.5 Pro and Gemini 2.5 Flash for agentic coding. I found they could both do the same task, but Gemini Pro was way slower and more expensive.

This blew my mind because I'd previously been obsessed with "best/smartest model", and suddenly realized what I actually wanted was "fastest/dumbest/cheapest model that can handle my task!"

44za12 · 2026-01-24T07:46:53 1769240813

For simple extraction tasks, a delimiter-separated string uses 11 tokens vs 35 for JSON. Output tokens are the latency bottleneck.

44za12 · 2025-09-01T10:21:24 1756722084

Shameless plug.

I’ve been using a cli tool i had created for over 2 years now, it just works. I had more ideas but never got to incorporate those.

https://github.com/44za12/horcrux

tedk-42 · 2025-09-01T11:14:44 1756725284

6 years for me if we're counting :)

https://github.com/edify42/otp-codegen

44za12 · 2025-09-01T12:35:45 1756730145

Love the minimalism.

44za12 · 2025-08-30T07:16:41 1756538201

Have been using remove.bg for this for years now.

bingbing123 · 2025-08-30T07:53:46 1756540426

Yes, I’ve built a free tool that delivers the same background removal results as remove.bg

44za12 · 2025-08-17T15:03:17 1755442997

Like a sempahore?

0x457 · 2025-08-17T18:11:30 1755454290

Semaphore limits concurrency, this one automatically groups (batches) input.

44za12 · 2025-08-14T16:55:38 1755190538

I’ve had great luck with all gemma 3 variants, on certain tasks it the 27B quantized version has worked as well as 2.5 flash. Can’t wait to get my hands dirty with this one.

44za12 · 2025-08-08T06:20:24 1754634024

Can you benchmark Kimi K2 and GLM 4.5 as well? Would be interesting to see where they land.

44za12 · 2025-08-07T22:07:50 1754604470

That was quick, vibe coded, I presume?

datadrivenangel · 2025-08-07T22:10:14 1754604614

The CSS animations are very revealing on that front from a performance perspective.

teaearlgraycold · 2025-08-07T22:25:35 1754605535

I tend to blame performance issues on the developer writing the code on a top of the line computer. There are too many WebGL effects on startup websites that were built to run on a M4 Max.

thewebguyd · 2025-08-07T22:37:17 1754606237

> There are too many WebGL effects on startup websites that were built to run on a M4 Max.

Tale as old as time. When the retina display macs first came out, we say web design suddenly no longer optimized for 1080p or less displays (and at the time, 1376x768 was the default resolution for windows laptops).

As much suffering as it'd be, I swear we'd end up with better software if we stopped giving devs top of the line machines and just issued whatever budget laptop is on sale at the local best buy on any given day.

01HNNWZ0MV43FF · 2025-08-07T22:48:16 1754606896

At my work every dev had two machines, which was great. The test machine is cattle, you don't install GCC on it, you reflash it whenever you need, and you test on it routinely. And it's also the cheapest model a customer might have. Then your dev machine is a beast with your kitten packages installed on it.

p1necone · 2025-08-07T22:48:31 1754606911

Develop on a super computer, test on $200 laptop - not really any suffering that way.

xpe · 2025-08-08T00:42:17 1754613737

To keep a fast feedback loop, build on the fast machine, deploy, test on the slow one.

teaearlgraycold · 2025-08-07T22:44:08 1754606648

I wouldn't go that far, but maybe split the difference at a modern i3 or the lowest spec Mac from last year.

It would be awesome if Apple or someone else could have an in-OS slider to drop the specs down to that of other chips. It'd probably be a lot of work to make it seamless, but being able to click a button and make an M4 Max look like an M4 would be awesome for testing.

p1necone · 2025-08-07T22:50:09 1754607009

Tbh even the absolute lowest spec Mx macs are insanely powerful, probably best to test on a low end x86 laptop.

universenz · 2025-08-07T22:51:22 1754607082

No no no.. go one better for the Mac. It should be whichever device/s which are next to be made legacy from Apple’s 7 year support window. That way you’re actually catering to the lowest common denominator.

datadrivenangel · 2025-08-07T22:30:59 1754605859

Yeah this is somewhat stuttery on an M2 mac.

seba_dos1 · 2025-08-07T22:11:42 1754604702

It's less than 200 lines of CSS. Easily doable by a human in 30 minutes.

mattgreenrocks · 2025-08-07T22:43:51 1754606631

I love how this has to be defended now, as if that was somehow unthinkable from a domain expert.