More

datadrivenangel · 2026-04-22T23:38:19 1776901099

My M5 Pro is getting ~11 tokens per second via OMLX for an 8 bit quant.

datadrivenangel · 2026-04-22T23:21:49 1776900109

loading data and doing arbitrary transformations turns out to be a huge PITA with most other tools over a long enough timescale.

datadrivenangel · 2026-04-22T21:47:39 1776894459

Gemma4 feels the most "claude-like" of all the models I've run locally on my M5 mbp.

chr15m · 2026-04-22T22:49:55 1776898195

I found on coding tasks that Qwen 3.5 can actually do the thing whereas Gemma 4 went off the rails frequently. Will try this new 3.6 release today.

da-x · 2026-04-23T12:23:56 1776947036

I use Qwen 3.5 122B on an RTX PRO 6000 with open code, and very pleased. I don't feel a need for using a closed model any more. The result after answering questions in Plan mode is almost always what I want, with very few occasional bugs. It does a lot of effort to see how the code I am working on is written now while extending it in the same style.

If they release a Qwen 3.6 that also makes good use of the card, may move to it.

verdverm · 2026-04-23T00:05:42 1776902742

There was a qwen-3.6 MoE six days ago that I thought was better than Gemma 4. Today's is a dense model. (gemma release both a 26B MoE and a 31B dense at the same time)

I have intention to evaluate all four on some evals I have, as long as I don't get squirrelled again.

datadrivenangel · 2026-04-22T21:46:47 1776894407

So far I'm unimpressed for local inference. got 11 tokens per second on omlx on an M5 Pro with 128gb of ram, so it took an hour to write a few hundred lines of code that didn't work. Opus and Sonnet in CC the same task successfully in a matter of minutes. The 3.6:35b model seemed okay on ollama yesterday.

Need to check out other harnesses for this besides claude code, but the local models are just painfully slow.

AlexC04 · 2026-04-23T14:09:49 1776953389

One other thing you might want to check out for running locally. (I have not independently verified yet, it's on the TODO list though)

https://docs.vllm.ai/en/latest/api/vllm/model_executor/layer...

vLLM apparently already has an implementation of turboquant available - which is said to losslessly reduce the memory footprint required by 6x and improve inference speed by 8x.

From what I understand, the steps are:

1. launch vLLM 2. execute a vLLM configure command like "use kv-turboquant for model xyz" 3. that's it

I've got two kids under 8 years old, a full time job, and a developer-tools project that takes like 105% of my mental interests... so there's been a bit of a challenge finding the time to swap from ollama to vLLM in order to find out if that is true.

SO buyer beware :D - and also - if anyone tries it, please let me know if it is worth the time to try it!

fshen · 2026-04-23T02:08:57 1776910137

I use the same computer as you do. m5 can run faster:

pip install mlx_lm

python -m mlx_vlm.convert --hf-path Qwen/Qwen3.6-27B --mlx-path ~/.mlx/models/Qwen3.6-27B-mxfp4 --quantize --q-mode mxfp4 --trust-remote-code

mlx_lm.generate --model ~/.mlx/models/Qwen3.6-27B-mxfp4 -p 'how cpu works' --max-tokens 300

Prompt: 13 tokens, 51.448 tokens-per-sec Generation: 300 tokens, 35.469 tokens-per-sec Peak memory: 14.531 GB

hresvelgr · 2026-04-23T01:21:47 1776907307

You have better specs than I do and I'm running the same model almost twice as fast through GGUF on llama cpp. I'd try some different harnesses.

mswphd · 2026-04-23T00:02:40 1776902560

this is a dense model, so that's expected. On a mac you'd want to try out the Mixture of Experts Qwen3.6 release, namely Qwen3.6-35B-A3B. On an M4 Pro I get ~70 tok/s with it. If your numbers are slower than this, it might be because you're accidentally using a "GGUF" formatted model, vs "MLX" (an apple-specific format that is often more performant for macs).

someguydave · 2026-04-22T23:42:52 1776901372

I got about 7 tokens/sec generation on an M2 max macbook running 8-bit quant on an MLX version.

noman-land · 2026-04-22T22:07:55 1776895675

OpenCode seems to be a lot better than Claude at using local models.

jedisct1 · 2026-04-23T06:42:58 1776926578

For local models, you should check out https://swival.dev instead of Claude Code.

datadrivenangel · 2026-04-22T21:03:49 1776891829

Just gotta do the Mississippi thing and hold kids back unless they meet standards. Don't leave them behind by pretending to leave no one behind.

clutter55561 · 2026-04-22T22:04:05 1776895445

School was like that back in Brazil in the 80s. Don’t get the grade, try again. And again. And again. Until you do. That is the real “not left behind”.

zozbot234 · 2026-04-22T21:10:22 1776892222

If kids are held back when they fail standards, shouldn't they also be allowed to race ahead when they exceed them?

bcrosby95 · 2026-04-22T21:30:11 1776893411

Off the top of my head I don't have problem with this, but the topic is about declining scores so I'm not sure how relevant this is.

datadrivenangel · 2026-04-22T22:27:46 1776896866

Yes. That's the whole idea of adaptive/personalized learning.

There are some interesting ethical questions about disparity of outcomes, but we should aim to have a system that educates each individual as best we can (without leaving kids behind). The current system of bludgeoning every individual student to be the same is not working well.

nsxwolf · 2026-04-22T22:11:12 1776895872

Always used to be the case, is that no longer an option?

rahimnathwani · 2026-04-22T22:20:23 1776896423

Not an option in San Francisco.

Back in 2022 I spoke with the principal of a nearby government-run school. She told me:

- under no circumstances would a kindergarten teacher teach any first grade content, and

- grade skip is so rare that she'd seen it only once in her entire career

datadrivenangel · 2026-04-22T22:28:20 1776896900

teachers not being able to teach 'higher level' content seems like a huge issue.

MagicMoonlight · 2026-04-22T21:59:20 1776895160

No, because you’ll end up really dysfunctional if you’re a child in a class with 18 year olds.

eigencoder · 2026-04-22T21:46:17 1776894377

Absolutely

bruckie · 2026-04-23T01:52:08 1776909128

A recent Atlantic article [1] by someone involved in the Mississippi reforms gives a good outline of what they did/are doing. It includes science-based reading curriculum and holding kids back (as you mentioned). It also includes other forms of accountability, including parental notification and empowering the state to force recalcitrant districts to improve. One notable quote:

"The law allowed the state to abolish these districts’ local school board and remove the local superintendent in favor of a state appointee who would report directly to the state board of education. A later amendment provided that removed local-school-board members would be barred from serving in that capacity again."

Politically unpopular in some cases (which local jurisdiction wants the state coming in and replacing your local school board?), but seems to be pretty effective.

[1] https://www.theatlantic.com/ideas/2026/04/mississippi-educat... (gift link)

datadrivenangel · 2026-04-22T17:51:47 1776880307

The challenge is token speed. I did some local coding yesterday with qwen3.6 35b and getting 10-40 tokens per second means that the wall time is much longer. 20 tokens per second is a bit over a thousand tokens per minute, which is slower than the the experience you get with Claude Code or the opus models.

Slower and worse is still useful, but not as good in two important dimensions.

fnordpiglet · 2026-04-22T18:16:35 1776881795

Also benchmark measures are not empirical experience measures and are well gamed. As other commenters have said the actual observed behavior is inferior, so it’s not just speed.

It’s ludicrous to believe a small parameter count model will out perform a well made high parameter count model. That’s just magical thinking. We’ve not empirically observed any flattening of the scaling laws, and there’s no reason to believe the scrappy and smart qwen team has discovered P=NP, FTL, or the magical non linear parameter count scaling model.

dTal · 2026-04-23T09:35:19 1776936919

Ooh, car analogy time!

It's kinda like saying a car with a 6L engine will always outperform a car with a 2L engine. There are so many different engineering tradeoffs, so many different things to optimize for, so many different metrics for "performance", that while it's broadly true, it doesn't mean you'll always prefer the 6L car. Maybe you care about running costs! Maybe you'd rather own a smaller car than rent a bigger one. Maybe the 2L car is just better engineered. Maybe you work in food delivery in a dense city and what you actually need is a 50cc moped, because agility and latency are more important than performance at the margins.

And if you're the only game in town, and you only sell 6L behemoths, and some upstart comes along and starts selling nippy little 2L utility vehicles (or worse - giving them away!) you should absolutely be worried about your lunch. Note that this literally happened to the US car industry when Japanese imports started becoming popular in the 80s...

anon373839 · 2026-04-22T22:56:53 1776898613

This is just blind belief. The model discussed in this topic already outperforms “well made” frontier LLMs of 12-18 months ago. If what you wrote is true, that wouldn’t have been possible.

datadrivenangel · 2026-04-22T23:26:23 1776900383

It's amazing that we can run models better than state of the art ~36 months ago on local consumer devices!

datadrivenangel · 2026-04-21T17:47:33 1776793653

"Effective defense requires architectural change: treating OAuth apps as third‑party vendors, eliminating long‑lived platform secrets, and designing for the assumption of provider‑side compromise."

Designing for provider-side compromise is very hard because that's the whole point of trust...

losvedir · 2026-04-21T17:55:10 1776794110

As someone trying to think about OAuth apps at our SaaS, it certainly is very hard.

Do any marketplaces have a good approach here? I know Cloudflare, after their similar Salesloft issue, has proposed proxying all 3rd party OAuth and API traffic through them. But that feels a little bit like trading one threat vector for another.

Other than standard good practices like narrow scopes, shorter expirations, maybe OAuth Client secret rotation, etc, I'm not sure what else can be done. Maybe allowlisting IP addresses that the requests associated with a given client can come from?

mooreds · 2026-04-21T18:13:49 1776795229

This was probably partly a Google refresh token theft (given the length of the access). No inside info, just looking at how the attack occurred.

OAuth 2.1[0] (an RFC that has been around longer than I've been at my employer) recommends some protections around refresh tokens, either making them sender constrained (tied to the client application by public/private key cryptography) or one-time use with revocation if it is used multiple times.

This is recommended for public clients, but I think makes sense for all clients.

The first option is more difficult to implement, but is similar to the IP address solution you suggest. More robust though.

The second option would have made this attack more difficult because the refresh token held by the legit client, context.ai, would have stopped working, presumably triggering someone to look into why and wonder if the tokens had been stolen.

0: https://datatracker.ietf.org/doc/html/draft-ietf-oauth-v2-1

hvb2 · 2026-04-21T20:10:33 1776802233

One time use of refresh tokens is really common? Where each refresh will get you a new access token AND a new refresh token?

That's standard in oidc I believe

mooreds · 2026-04-21T20:25:45 1776803145

I don't have data on whether it is common, but I know a few OAuth vendors support it. (I work for one such.)

cosgrove · 2026-04-22T13:36:27 1776864987

In case anyone is wondering, here is the proposal from Cloudflare: https://blog.cloudflare.com/saas-to-saas-security/

TL;DR: visibility and control for all connections (marketplace apps, 4th parties) to your 3rd party SaaS platforms. See which connections are active as well as when and how much they transmit. Transparent token splitting to force connections through the proxy as well as instant token revocation and ACLs.

We are currently building this and would appreciate your feedback!

wouldbecouldbe · 2026-04-21T18:03:27 1776794607

I mean the admin account had visibility of clients env vars, thats maybe not really great in the first place.

iririririr · 2026-04-21T19:05:17 1776798317

you'd think. but this is a js dev world.

nextjs app bake all env vars on the client side code!! it's all public, unless you prefix the name with private_ or something.

rozenmd · 2026-04-21T19:44:16 1776800656

This is incorrect.

You preface with PUBLIC_ to expose them in client side code.

nyc_data_geek1 · 2026-04-21T18:02:49 1776794569

Corroborates that zero-trust until now has been largely marketing gibberish. Security by design means incorporating concepts such as these to not assume that your upstream providers will not be utterly owned in a supply chain attack.

datadrivenangel · 2026-04-21T17:40:58 1776793258

It's important to occasionally execute or imprison a general to motivate the remaining generals. Rarely though.

datadrivenangel · 2026-04-21T17:36:48 1776793008

Come back after 500 karma.

datadrivenangel · 2026-04-21T17:03:07 1776790987

Organizations should do it not catastrophically wrongly, especially once a core design / concept is mostly solidified. Putting a little time into reliability and guardrails prevents a huge amount of downside.

I've been at organizations that don't think engineers should write tests because it takes too much time and slows them down...