Hacker Newsnew | past | comments | ask | show | jobs | submit | datadrivenangel's commentslogin

My M5 Pro is getting ~11 tokens per second via OMLX for an 8 bit quant.

loading data and doing arbitrary transformations turns out to be a huge PITA with most other tools over a long enough timescale.

Gemma4 feels the most "claude-like" of all the models I've run locally on my M5 mbp.

I found on coding tasks that Qwen 3.5 can actually do the thing whereas Gemma 4 went off the rails frequently. Will try this new 3.6 release today.

I use Qwen 3.5 122B on an RTX PRO 6000 with open code, and very pleased. I don't feel a need for using a closed model any more. The result after answering questions in Plan mode is almost always what I want, with very few occasional bugs. It does a lot of effort to see how the code I am working on is written now while extending it in the same style.

If they release a Qwen 3.6 that also makes good use of the card, may move to it.


There was a qwen-3.6 MoE six days ago that I thought was better than Gemma 4. Today's is a dense model. (gemma release both a 26B MoE and a 31B dense at the same time)

I have intention to evaluate all four on some evals I have, as long as I don't get squirrelled again.


So far I'm unimpressed for local inference. got 11 tokens per second on omlx on an M5 Pro with 128gb of ram, so it took an hour to write a few hundred lines of code that didn't work. Opus and Sonnet in CC the same task successfully in a matter of minutes. The 3.6:35b model seemed okay on ollama yesterday.

Need to check out other harnesses for this besides claude code, but the local models are just painfully slow.


One other thing you might want to check out for running locally. (I have not independently verified yet, it's on the TODO list though)

https://docs.vllm.ai/en/latest/api/vllm/model_executor/layer...

vLLM apparently already has an implementation of turboquant available - which is said to losslessly reduce the memory footprint required by 6x and improve inference speed by 8x.

From what I understand, the steps are:

1. launch vLLM 2. execute a vLLM configure command like "use kv-turboquant for model xyz" 3. that's it

I've got two kids under 8 years old, a full time job, and a developer-tools project that takes like 105% of my mental interests... so there's been a bit of a challenge finding the time to swap from ollama to vLLM in order to find out if that is true.

SO buyer beware :D - and also - if anyone tries it, please let me know if it is worth the time to try it!


I use the same computer as you do. m5 can run faster:

pip install mlx_lm

python -m mlx_vlm.convert --hf-path Qwen/Qwen3.6-27B --mlx-path ~/.mlx/models/Qwen3.6-27B-mxfp4 --quantize --q-mode mxfp4 --trust-remote-code

mlx_lm.generate --model ~/.mlx/models/Qwen3.6-27B-mxfp4 -p 'how cpu works' --max-tokens 300

Prompt: 13 tokens, 51.448 tokens-per-sec Generation: 300 tokens, 35.469 tokens-per-sec Peak memory: 14.531 GB


You have better specs than I do and I'm running the same model almost twice as fast through GGUF on llama cpp. I'd try some different harnesses.

this is a dense model, so that's expected. On a mac you'd want to try out the Mixture of Experts Qwen3.6 release, namely Qwen3.6-35B-A3B. On an M4 Pro I get ~70 tok/s with it. If your numbers are slower than this, it might be because you're accidentally using a "GGUF" formatted model, vs "MLX" (an apple-specific format that is often more performant for macs).

I got about 7 tokens/sec generation on an M2 max macbook running 8-bit quant on an MLX version.

OpenCode seems to be a lot better than Claude at using local models.

For local models, you should check out https://swival.dev instead of Claude Code.

Just gotta do the Mississippi thing and hold kids back unless they meet standards. Don't leave them behind by pretending to leave no one behind.

School was like that back in Brazil in the 80s. Don’t get the grade, try again. And again. And again. Until you do. That is the real “not left behind”.

If kids are held back when they fail standards, shouldn't they also be allowed to race ahead when they exceed them?

Off the top of my head I don't have problem with this, but the topic is about declining scores so I'm not sure how relevant this is.

Yes. That's the whole idea of adaptive/personalized learning.

There are some interesting ethical questions about disparity of outcomes, but we should aim to have a system that educates each individual as best we can (without leaving kids behind). The current system of bludgeoning every individual student to be the same is not working well.


Always used to be the case, is that no longer an option?

Not an option in San Francisco.

Back in 2022 I spoke with the principal of a nearby government-run school. She told me:

- under no circumstances would a kindergarten teacher teach any first grade content, and

- grade skip is so rare that she'd seen it only once in her entire career


teachers not being able to teach 'higher level' content seems like a huge issue.

No, because you’ll end up really dysfunctional if you’re a child in a class with 18 year olds.

Absolutely

A recent Atlantic article [1] by someone involved in the Mississippi reforms gives a good outline of what they did/are doing. It includes science-based reading curriculum and holding kids back (as you mentioned). It also includes other forms of accountability, including parental notification and empowering the state to force recalcitrant districts to improve. One notable quote:

"The law allowed the state to abolish these districts’ local school board and remove the local superintendent in favor of a state appointee who would report directly to the state board of education. A later amendment provided that removed local-school-board members would be barred from serving in that capacity again."

Politically unpopular in some cases (which local jurisdiction wants the state coming in and replacing your local school board?), but seems to be pretty effective.

[1] https://www.theatlantic.com/ideas/2026/04/mississippi-educat... (gift link)


The challenge is token speed. I did some local coding yesterday with qwen3.6 35b and getting 10-40 tokens per second means that the wall time is much longer. 20 tokens per second is a bit over a thousand tokens per minute, which is slower than the the experience you get with Claude Code or the opus models.

Slower and worse is still useful, but not as good in two important dimensions.


Also benchmark measures are not empirical experience measures and are well gamed. As other commenters have said the actual observed behavior is inferior, so it’s not just speed.

It’s ludicrous to believe a small parameter count model will out perform a well made high parameter count model. That’s just magical thinking. We’ve not empirically observed any flattening of the scaling laws, and there’s no reason to believe the scrappy and smart qwen team has discovered P=NP, FTL, or the magical non linear parameter count scaling model.


Ooh, car analogy time!

It's kinda like saying a car with a 6L engine will always outperform a car with a 2L engine. There are so many different engineering tradeoffs, so many different things to optimize for, so many different metrics for "performance", that while it's broadly true, it doesn't mean you'll always prefer the 6L car. Maybe you care about running costs! Maybe you'd rather own a smaller car than rent a bigger one. Maybe the 2L car is just better engineered. Maybe you work in food delivery in a dense city and what you actually need is a 50cc moped, because agility and latency are more important than performance at the margins.

And if you're the only game in town, and you only sell 6L behemoths, and some upstart comes along and starts selling nippy little 2L utility vehicles (or worse - giving them away!) you should absolutely be worried about your lunch. Note that this literally happened to the US car industry when Japanese imports started becoming popular in the 80s...


This is just blind belief. The model discussed in this topic already outperforms “well made” frontier LLMs of 12-18 months ago. If what you wrote is true, that wouldn’t have been possible.

It's amazing that we can run models better than state of the art ~36 months ago on local consumer devices!

"Effective defense requires architectural change: treating OAuth apps as third‑party vendors, eliminating long‑lived platform secrets, and designing for the assumption of provider‑side compromise."

Designing for provider-side compromise is very hard because that's the whole point of trust...


As someone trying to think about OAuth apps at our SaaS, it certainly is very hard.

Do any marketplaces have a good approach here? I know Cloudflare, after their similar Salesloft issue, has proposed proxying all 3rd party OAuth and API traffic through them. But that feels a little bit like trading one threat vector for another.

Other than standard good practices like narrow scopes, shorter expirations, maybe OAuth Client secret rotation, etc, I'm not sure what else can be done. Maybe allowlisting IP addresses that the requests associated with a given client can come from?


This was probably partly a Google refresh token theft (given the length of the access). No inside info, just looking at how the attack occurred.

OAuth 2.1[0] (an RFC that has been around longer than I've been at my employer) recommends some protections around refresh tokens, either making them sender constrained (tied to the client application by public/private key cryptography) or one-time use with revocation if it is used multiple times.

This is recommended for public clients, but I think makes sense for all clients.

The first option is more difficult to implement, but is similar to the IP address solution you suggest. More robust though.

The second option would have made this attack more difficult because the refresh token held by the legit client, context.ai, would have stopped working, presumably triggering someone to look into why and wonder if the tokens had been stolen.

0: https://datatracker.ietf.org/doc/html/draft-ietf-oauth-v2-1


One time use of refresh tokens is really common? Where each refresh will get you a new access token AND a new refresh token?

That's standard in oidc I believe


I don't have data on whether it is common, but I know a few OAuth vendors support it. (I work for one such.)

In case anyone is wondering, here is the proposal from Cloudflare: https://blog.cloudflare.com/saas-to-saas-security/

TL;DR: visibility and control for all connections (marketplace apps, 4th parties) to your 3rd party SaaS platforms. See which connections are active as well as when and how much they transmit. Transparent token splitting to force connections through the proxy as well as instant token revocation and ACLs.

We are currently building this and would appreciate your feedback!


I mean the admin account had visibility of clients env vars, thats maybe not really great in the first place.

you'd think. but this is a js dev world.

nextjs app bake all env vars on the client side code!! it's all public, unless you prefix the name with private_ or something.


This is incorrect.

You preface with PUBLIC_ to expose them in client side code.


Corroborates that zero-trust until now has been largely marketing gibberish. Security by design means incorporating concepts such as these to not assume that your upstream providers will not be utterly owned in a supply chain attack.

It's important to occasionally execute or imprison a general to motivate the remaining generals. Rarely though.

Come back after 500 karma.

Organizations should do it not catastrophically wrongly, especially once a core design / concept is mostly solidified. Putting a little time into reliability and guardrails prevents a huge amount of downside.

I've been at organizations that don't think engineers should write tests because it takes too much time and slows them down...


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: