> This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating.
Sorry, you thought a prompt was a suitable replacement for a testing suite?
For each of the 4 gemma-4-*-it models there has been published an associated small model gemma-4-*-it-assistant, to be used for MTP.
If a GGUF file is generated for MTP, it must include both the big model and the small model. There was a reference in another comment to a PR for llama.cpp, which also included updates for the Python program used for conversion from the safetensors files, which presumably can handle the combining of the two paired Gemma 4 models.
Really excited to try this once it is merged into llama.cpp.
Gemma 4 26B-A4B is much quicker on my setup vs Qwen3.6-35B-A3B (by about 3x), so the thought of a 1.5 speedup is tantalizing.
Have tried draft models to limited success (the smaller 3B draft model in addition to a dense 14B Ministral model introduced too much overhead already)
I only started playing around with local inference a couple weeks ago. Prior to that I was just using Gemini via web since it came with my Workspace subscription, but I did not want to be reliant on the cloud.
Others will have a better idea since they've been messing around with local inference longer than I, but I am quite impressed with the models I have been loading on my laptop with only iGPU. As of this week I no longer feel like I am playing second fiddle with slow inference and small models. Gemma 4 (and maybe Qwen3.5, haven't tried it yet) seem to have changed the game this month!
Even with trying some absolutely shiiiiite models (I only had 16GB unified RAM at the start), I was suitably impressed that I splashed the $300 to double my RAM. I am happy that this one time cost was enough to break through to smarter models and faster inference. No ongoing cloud costs!
It's awesome. Even on a trash computer you can run a small model that works just about as good as anything else for basic questions for free and no privacy issues. It's gotta be the future.
Just last week, I was trying to map the weird and wonderful column names emitted by a NetScaler’s detailed REST log into OpenTelemetry semantics.
The NetScaler is basically abandoned by its dying vendor. Hence, its new features like sending logs directly to Splunk compatible receivers are basically undocumented. I’m sure there’s like three of us masochists out there stumbling our way through the brambles.
The Open Telemetry end is a mess of “deprecated” and “beta”, copying the shifting sands of other cloud native projects like Kubernetes.
Even with carefully curated Markdown documentation references and sample logs, every modern “frontier” AI makes basic mistakes and hallucinates like crazy no matter how I stuff their context.
This isn’t an Erdős problem! This is just getting logs from point A to point B.
(I would drop this somewhere more relevant but apparently replying to a thing from 17 days ago would be necroing here + PMs are hard or something)
>[1] With what prompt!? I like the terse output! Do share...
Not sure if it's complete or completely right, but the matter came up in a recent session, and when asked what gave, Jim 'n' I came up with some supposedly relevant factors, at least one of which is news to me (supposedly, steering LLMs using negative instructions isn't counterproductive anymore (not that I'd been resisting the temptation anyway)):
It's so true. I bet 80% of questions normal people even ask chatgpt/copilot could be answered with an 8b model trained on recent data.
I don't think people realize how small the gap between free to cheap models have from frontier models. It's going to be commoditized a lot faster than marketing will catch up. Once cash gets tight or prices rise it's more or less done for.
Especially considering some of the small free/cheap models can one shot code now.
I've been experimenting with the 26B-A4B model with some surprisingly good results (both in inference speed and code quality — 15 tok/s, flying along!), vs my last few experiments with Devstral 24B. Not sure whether I can fit that 35B Qwen model everybody's so keen on, on my 32GB unified RAM.
However I think I may be in the minority of HN commenters exploring models for local inference.
reply