Hacker Newsnew | past | comments | ask | show | jobs | submit | 44za12's commentslogin

Specialised models easily beat SOTA, case in point: https://nehmeailabs.com/flashcheck


All of us use the same keyboards more or less, maybe us randomly typing a large number is not as random as we would like to think. Just like how “asdf”, “xcyb” are common strings because these keys are together, there has to be some pattern here as well.


Especially for those very large numbers in the top ten (like 166884362531608099236779 with 6779 searches), and the relatively small number of total "votes" (probably less than a million), I think the only likely explanation for their rank is ballot-stuffing.


That means there is less entropy than purely random strings, not that this specific number would be so far outside the distribution. My money would be on someone hammering it.


This is the way. I actually mapped out the decision tree for this exact process and more here:

https://github.com/NehmeAILabs/llm-sanity-checks


That's interesting. Is there any kind of mapping to these respective models somewhere?


Yes, I included a 'Model Selection Cheat Sheet' in the README (scroll down a bit).

I map them by task type:

Tiny (<3B): Gemma 3 1B (could try 4B as well), Phi-4-mini (Good for classification). Small (8B-17B): Qwen 3 8B, Llama 4 Scout (Good for RAG/Extraction). Frontier: GPT-5, Llama 4 Maverick, GLM, Kimi

Is that what you meant?


at the sake of being obvious, do you have a tiny llm gating this decision and classifying and directing the task to its appropriate solution?


>Before you reach for a frontier model, ask yourself: does this actually need a trillion-parameter model?

>Most tasks don't. This repo helps you figure out which ones.

About a year ago I was testing Gemini 2.5 Pro and Gemini 2.5 Flash for agentic coding. I found they could both do the same task, but Gemini Pro was way slower and more expensive.

This blew my mind because I'd previously been obsessed with "best/smartest model", and suddenly realized what I actually wanted was "fastest/dumbest/cheapest model that can handle my task!"


For simple extraction tasks, a delimiter-separated string uses 11 tokens vs 35 for JSON. Output tokens are the latency bottleneck.


Shameless plug.

I’ve been using a cli tool i had created for over 2 years now, it just works. I had more ideas but never got to incorporate those.

https://github.com/44za12/horcrux


6 years for me if we're counting :)

https://github.com/edify42/otp-codegen


Love the minimalism.


Have been using remove.bg for this for years now.


Yes, I’ve built a free tool that delivers the same background removal results as remove.bg


Like a sempahore?


Semaphore limits concurrency, this one automatically groups (batches) input.


I’ve had great luck with all gemma 3 variants, on certain tasks it the 27B quantized version has worked as well as 2.5 flash. Can’t wait to get my hands dirty with this one.


Can you benchmark Kimi K2 and GLM 4.5 as well? Would be interesting to see where they land.


That was quick, vibe coded, I presume?


The CSS animations are very revealing on that front from a performance perspective.


I tend to blame performance issues on the developer writing the code on a top of the line computer. There are too many WebGL effects on startup websites that were built to run on a M4 Max.


> There are too many WebGL effects on startup websites that were built to run on a M4 Max.

Tale as old as time. When the retina display macs first came out, we say web design suddenly no longer optimized for 1080p or less displays (and at the time, 1376x768 was the default resolution for windows laptops).

As much suffering as it'd be, I swear we'd end up with better software if we stopped giving devs top of the line machines and just issued whatever budget laptop is on sale at the local best buy on any given day.


At my work every dev had two machines, which was great. The test machine is cattle, you don't install GCC on it, you reflash it whenever you need, and you test on it routinely. And it's also the cheapest model a customer might have. Then your dev machine is a beast with your kitten packages installed on it.


Develop on a super computer, test on $200 laptop - not really any suffering that way.


To keep a fast feedback loop, build on the fast machine, deploy, test on the slow one.


I wouldn't go that far, but maybe split the difference at a modern i3 or the lowest spec Mac from last year.

It would be awesome if Apple or someone else could have an in-OS slider to drop the specs down to that of other chips. It'd probably be a lot of work to make it seamless, but being able to click a button and make an M4 Max look like an M4 would be awesome for testing.


Tbh even the absolute lowest spec Mx macs are insanely powerful, probably best to test on a low end x86 laptop.


No no no.. go one better for the Mac. It should be whichever device/s which are next to be made legacy from Apple’s 7 year support window. That way you’re actually catering to the lowest common denominator.


Yeah this is somewhat stuttery on an M2 mac.


It's less than 200 lines of CSS. Easily doable by a human in 30 minutes.


I love how this has to be defended now, as if that was somehow unthinkable from a domain expert.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: