Hacker Newsnew | past | comments | ask | show | jobs | submit | julianlam's commentslogin

Reader mode also works well

> This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating.

Sorry, you thought a prompt was a suitable replacement for a testing suite?


hey man it works great barely and also costs a bunch of money everytime we run it. we also can't trust the results, relax.

If you are invested in AI stocks, this is the way. You are basically funneling money from software companies into your brokerage account. Keep going.

So then these models could be used by llama.cpp today with the -md switch?

Interesting, must try tomorrow.


Does this mean there will be new Gemma 4 models released with MTP, or are they already available in existing models + quants?

For each of the 4 gemma-4-*-it models there has been published an associated small model gemma-4-*-it-assistant, to be used for MTP.

If a GGUF file is generated for MTP, it must include both the big model and the small model. There was a reference in another comment to a PR for llama.cpp, which also included updates for the Python program used for conversion from the safetensors files, which presumably can handle the combining of the two paired Gemma 4 models.


They have now been released on e.g Hugging Face with model suffixes "-assistant".

Really excited to try this once it is merged into llama.cpp.

Gemma 4 26B-A4B is much quicker on my setup vs Qwen3.6-35B-A3B (by about 3x), so the thought of a 1.5 speedup is tantalizing.

Have tried draft models to limited success (the smaller 3B draft model in addition to a dense 14B Ministral model introduced too much overhead already)


On vllm with a 5090 I get 120-180TPS with the awq 4 bit quant + MTP speculative decoding

For gemma4 26B, same quantization, I get >200TPS.

Also note that qwen is extremely inefficient in reasoning; the reasoning chains are ~3x longer than gemma on average


We're all busy doing work instead of incessantly commenting about our models?

I only started playing around with local inference a couple weeks ago. Prior to that I was just using Gemini via web since it came with my Workspace subscription, but I did not want to be reliant on the cloud.

Others will have a better idea since they've been messing around with local inference longer than I, but I am quite impressed with the models I have been loading on my laptop with only iGPU. As of this week I no longer feel like I am playing second fiddle with slow inference and small models. Gemma 4 (and maybe Qwen3.5, haven't tried it yet) seem to have changed the game this month!

Even with trying some absolutely shiiiiite models (I only had 16GB unified RAM at the start), I was suitably impressed that I splashed the $300 to double my RAM. I am happy that this one time cost was enough to break through to smarter models and faster inference. No ongoing cloud costs!


It's awesome. Even on a trash computer you can run a small model that works just about as good as anything else for basic questions for free and no privacy issues. It's gotta be the future.

It's so interesting to see the wild pendulum swings of LLM sentiment here.

If one likes a model then it's capable of one-shotting entire apps.

Otherwise it's "only suitable for the most trivial tasks".

Never in between.


You're confusing "different people with different opinions" with "wild pendulum swings".

Personally my opinion in this regard is highly consistent over time.


I have no trivial tasks.

Just last week, I was trying to map the weird and wonderful column names emitted by a NetScaler’s detailed REST log into OpenTelemetry semantics.

The NetScaler is basically abandoned by its dying vendor. Hence, its new features like sending logs directly to Splunk compatible receivers are basically undocumented. I’m sure there’s like three of us masochists out there stumbling our way through the brambles.

The Open Telemetry end is a mess of “deprecated” and “beta”, copying the shifting sands of other cloud native projects like Kubernetes.

Even with carefully curated Markdown documentation references and sample logs, every modern “frontier” AI makes basic mistakes and hallucinates like crazy no matter how I stuff their context.

This isn’t an Erdős problem! This is just getting logs from point A to point B.


(I would drop this somewhere more relevant but apparently replying to a thing from 17 days ago would be necroing here + PMs are hard or something)

>[1] With what prompt!? I like the terse output! Do share...

Not sure if it's complete or completely right, but the matter came up in a recent session, and when asked what gave, Jim 'n' I came up with some supposedly relevant factors, at least one of which is news to me (supposedly, steering LLMs using negative instructions isn't counterproductive anymore (not that I'd been resisting the temptation anyway)):

https://gemini.google.com/share/7af54a6861d7#:~:text=What%20...

With the caveat (or bonus) that it can go (?:too)? far when told to "be blunt" and not to "pull punches":

https://i.vgy.me/WHRZD7.png (from 2024-09) (in this case, in user-config persistent instructions in Kagi's multi-LLM thing)


It's so true. I bet 80% of questions normal people even ask chatgpt/copilot could be answered with an 8b model trained on recent data.

I don't think people realize how small the gap between free to cheap models have from frontier models. It's going to be commoditized a lot faster than marketing will catch up. Once cash gets tight or prices rise it's more or less done for.

Especially considering some of the small free/cheap models can one shot code now.


Interesting that Gemma 4 didn't crack the top 10.

I've been experimenting with the 26B-A4B model with some surprisingly good results (both in inference speed and code quality — 15 tok/s, flying along!), vs my last few experiments with Devstral 24B. Not sure whether I can fit that 35B Qwen model everybody's so keen on, on my 32GB unified RAM.

However I think I may be in the minority of HN commenters exploring models for local inference.


Can you elaborate on your setup? What harness are you using with Gemma 4 on your 32GB machine?

Lo-fi channels used to show the artist and song names. These newer ones don't bother with credits, or have made up song titles.

E.g. "funky chicken jam"


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: