It's amazingly vacuous isn't it? I think the most interesting read was the fact that they were surprised llama.cpp crashed when they used a bad set of commandline arguments.
Although in the section immediately above the observation they claimed that they ran 10 whole completions with 100% success rate. So who knows.
I have to admit I slightly miss the flood of AI-psychosis research papers that seemed to be popping up a couple of months ago. Good to know there's still one or two new ones floating around.
I'm not sure there's a one-stop shop for this at the moment. I think the process is:
* Have a box with sufficient spare (V)RAM -- probably 8G for simple categorization with qwen3.5-4b, and 24G or more for more intelligent categorization with qwen3.6-27b or gemma4-31b.
* Download or compile llama.cpp. Choose a model, then choose one of the "quantized" builds that will actually fit on your hardware. There are literally hundreds to thousands of these per model on Hugging Face.
* Spend half a day tuning command-line parameters until llama.cpp doesn't crash.
* Watch llama.cpp regularly OOM itself, then put it in a systemd service with a memory limit so it doesn't take the entire machine down when it dies.
* Download all your photos to a folder.
* Start vibing a Python script to categorize your images by repeatedly prompting the LLM with each image in turn.
* Spend days tweaking/refining the prompt to try to get the LLM to actually do what you want.
The endgame is one of:
* The local model categorizes your images. Yay.
* The local model is too slow and you give up. Boo.
* The local model is too slow, so you spend $1k-$10k on hardware. Your image categorization task becomes a cover story for buying new gear. Yay.
* The local model can't understand your categorization metric, so you give up. Boo.
* You eagerly await news of the next open model being released. Yay?
* You consider replacing your local model with a frontier model, but then you realize you'd be spending $500 to categorize your photos. Boo.
* You refuse to allow Google/Gemini/Anthropic to train on your nudes. Boo.
Yup, especially when for a lot of us, the price of the frontier subscription has become a cost of doing business over the last 6 months.
If you're already doing big boy stuff with big boy models, then... just carry on trucking!
Only place I'd differ is for vision/OCR tasks. Small/medium open weights models are as good as SoTa, and token prices for prefill are kinda very not worth it for larger batch tasks.
Other thing that people forget is, if you want to have even a smallish LLM as a reliable personal service, you've got to carve out 16-24 of (V)RAM and leave it permanently running.
The latest rounds of open weights vision language models are incredibly good. Like, massively good. Open weights vision capabilities trade blows with frontier models. Over the last few months I'd roughly rank capabilities as Gemini -> {chatgpt and SoTa open weights models} -> Claude.
qwen3.5-2b and qwen3.5-4b are great at document parsing. They can run on CPU
qwen3.6-27b and gemma4-31b are borderline better than the human eye in some cases. Their OCR isn't perfect, but they're seriously good. They can still run on the CPU but you'll be waiting minutes per document.
You can demand JSON, YAML, MD, or freeform text just by varying the prompt. Even if you have a custom template, you can just put that in the prompt and they'll do an OK-ish job.
There's also models that aren't in the r/locallama zeitgeist. IBM released a new 4b parameter model for structured text extraction last week, and there's a sea of recent chinese OCR models too.
IMO the open wights models are so good that in a lot of cases it's not worth paying frontier labs for OCR purposes. The only barrier to entry is the effort to set up a pipeline, and havin the spare CPU/GPU capacity.
I wonder if for a model that small with a permissive license it might not be worth their time to host a commercial grade inference stack?
Might be easier to chuck it over the fence and let other providers handle it as it'll run in almost any commercial grade card?
Also speculating, but I wonder if it might also create a bit of a pricing problem relative to Gemini flashlight depending on serving cost and quality of outputs?
As a comparison, despite being SotA for their size, the smallest qwen models on openrouter (27b and 35b) are not at all worth using, as there are way bigger and better models for less oricemon a per token basis
If you were to believe a lot of metrics Gemma 31B it’s much better than flash lite. It seems like I should be able to pay Google to use it and that should be at least a secretary, called action how I can do that but it’s missing from both the blog post entirely.
i dont know what are you talking about, i replaced an older gpt4o with a finetuned qwen. there is a huge amount of "AI, that can be done with those models, or partly by those models." Huge amount of people would not notice the difference. And if you prepare the context correctly, even bigger slice of people would not notice.
If it helps, I mean it in a really literal sense. qwen3.6 27b is currently $3.20 per million tokens on openrouter right now which is way overpriced. As good as the 27b is, kimi k2.5 $3.00 and it's just in another league in terms of capability. There's no reason to spend money on it.
And even alibaba's own qwen3.6-plus is $1.95, so it's kinda easy to come to a conclusion that alibaba (nor anyone else) is really interested in hosting that model.
And don't get me wrong, I fully agree with you, qwen3.6 27b is an amazing model. I run it on my own hardware and every day I'm constantly surprised with what it can zero shot.
Genuinely curious, what are you "fine tuning" these smaller models to do reliably? I hear this talked about a lot but very few people actually cough up examples, and I'd love to actually hear of one.
depends, a super small one finetuned to do function calling instead sending it to big model and waiting, instead, you ask for a revenue in last month, i do a small llm function call -> show results. some bigger ones, analysis, summary, classification.
what is great with smaller ones, and im looking at 2b, 4b is you can get a huge throughput with just vllm and a couple of consumer gpus.
what i usually do is basically distillation of a big one onto smaller one.
Using protobuf is practical enough in embedded. This person isn't the first and won't be the last. Way faster than JSON, way slower than C structs.
However protobuf is ridiculously interchangeable and there are serializers for every language. So you can get your interfaces fleshed out early in a project without having to worry that someone will have a hard time ingesting it later on.
Yes it's a pain how an empty array is a valid instance of every message type, but at least the fields that you remember to send are strongly typed. And field optionality gives you a fighting chance that your software can still speak to the unit that hasn't been updated in the field for the last five years.
On the embedded side, nanopb has worked well for us. I'm not missing having to hand maintain ad-hoc command parsers on the embedded side, nor working around quirks and bugs of those parsers on the desktop side
I feel like there's a semi-philosophical question somehwere here. Recursion is clearly a core computer science concept (see: many university courses, or even the low level implementation of many data structures), but it's surprisingly rare to see it in "day to day" code (i.e I probably don't write recursive code in a typical week, but I know it's in library code I depend on...)
But why do you think we live in a world that likes to hide recursion? Why is it common for tree data structure APIs to expose visitors, rather than expecting you write your own recursive depth/breadth-first tree traversal?
Is there something innate in human nature that makes recursion less comprehensible than looping? In my career I've met many programmers who don't 'do' recursion, but none who are scared of loops.
And to me the weird thing about it is, looping is just a specialized form of recursion, so if you can wrap your head around a for loop it means you already understand tail call recursion.
> but it's surprisingly rare to see it in "day to day" code
I rarely use them because I became tired of having to explain them to others, where I've never had to explain a simple while loop that accomplishes the same thing with, usually literally, a couple more lines of code.
From all of my experience, recursion is usually at the expense of clarity, and not needed.
I think it's related to the door memory effect [1]: you loose the sense of state when hopping into the function, even though it's itself.
Not "doing" recursion as a principle is often a sign the person has not been exposed to functional languages or relational kind of programming like Prolog. It often points at a lack of experience with what perhaps is not so mainstream.
Or the person is sensibly trying to make the code easier for other people to understand.
I am tech lead and architect for large financial systems written in Java but have done a bunch of Common Lisp and Clojure projects in the past. I will still avoid any recursion and as people to remove recursion from their PRs unless it is absolutely best way to get readable and verifiable code.
As a developer your job is not to look for intellectual rewards when writing code and your job is not to find elegant solutions to problems (although frequently elegant solutions are the best ones). Your job is taking responsibility for the reliability, performance and future maintenance of whatever you create.
In my experience there is nothing worse than having bright engineers on a project who don't understand they are creating for other less bright engineers who will be working with it after the bright engineer gets bored with the project and moves on to another green field, rewarding task.
The stack traces when something goes wrong are inscrutable under recursion. Same when looking at the program state using debuggers.
Fundamentally, the actual state of the program does not match the abstract state used when programming a recursive function. You are recursively solving subproblems, but when something goes wrong, it becomes very hard to reason about all the partial solutions within the whole problem.
> The stack traces when something goes wrong are inscrutable under recursion.
Hmm. This is a real issue, for the simple case. If tail recursion is not optimized correctly then you end up with a bunch of stack frames, wasted memory...
I propose partially this is a tooling issue, not a fundamental problem with recursion. For tail recursion the frames could be optimized away and a virtual counter employed.
For more complex cases, I'd argue it matters less. Saving on stack frames is still preferable, however this can be acheived with judicious use of inlining. But the looping construct is worse here, as you cannot see prior invocations of the loop, and so have generally no idea of program execution without resorting to log tracing, while even the inlined recusion will provide some idea of the previous execution path.
I don't agree with your last statement. I've been programming forever, and I understand recursion, and I use it, but I never equate it with loops in my mind (unless I'm deliberately thinking that way, like now) and I always find it more difficult to think about than loops.
The author is making a deliberate point about undefined behaviour in the article. Hence them not executing worked examples.
In fact, by not doing so they are making a subtle implicit statement that it is uninteresting to consider actually attempting to execute these snippets.
The third paragraph of the "P.S" of the article (you have to press submit to see it) is the one that really gives the game away.
More than implementation defined, for some you need context that simply isn't given. On the ones with mixed-type structs, even if you know what system it's compiled for you don't know if someone has used pragma pack 1 to byte pack the data instead of standard packing. Just seeing the struct, you still don't know.
I agree that in theory it would be cool to have C code that uses only defined behavior and works on all platforms for all eternity. However, I think most programs have a fairly clear understanding of what platforms (OS+arch) they are targeting and what compilers they are using to target those platforms.
If the compiler has defined behavior (and you have unit tests for that behavior) on all of these platforms, I don't think it is a huge deal. (Ideally you wouldn't... but sometimes its an accident or unavoidable)
As an example, while struct padding (problem 1) might not technically be in the spec, it is a cornerstone of FFI and every new compiler (that supports C FFI) has a way to compile structs with the same padding.
To my original point, if the article had instead given examples of compilers + architectures that produced different answers, I might feel differently. However, just saying mentioning that these weird edge cases are undefined (in the spec) doesn't mean much to me.
FWIW, 85k 4-input luts is huge by the standards of any softcore with "micro" in the name.
It comfortably surpasses the capacity of most of the Actel aerospace FPGAs that I tend to work with.
And I think it's so "micro" that the majority of all of Lattice's FPGAs wouldn't fit it either.
And the Lattice ECP5 is advertised with 85k LUTs, which would seemingly limit its use to edification as instantiating this softcore would consume the entire chip. For any other purpose if you wanted a chip that was only a CPU, you would buy a CPU.
The Lattice ECP5 must have been just an example, because in the Github repository there is another example about how to run Linux on it on a 33000 logic cell Artix FPGA board, so I assume that the core must take significantly less than 30000 cells.
Any PowerPC core, even a relatively small one like this, is much more powerful than the soft cores that are used in FPGAs when minimum size is desired.
There are also a few other bigger open-source POWER cores, which can be used for higher performance.
Although in the section immediately above the observation they claimed that they ran 10 whole completions with 100% success rate. So who knows.
I have to admit I slightly miss the flood of AI-psychosis research papers that seemed to be popping up a couple of months ago. Good to know there's still one or two new ones floating around.