I use Qwen 3.5 122B on an RTX PRO 6000 with open code, and very pleased. I don't feel a need for using a closed model any more. The result after answering questions in Plan mode is almost always what I want, with very few occasional bugs. It does a lot of effort to see how the code I am working on is written now while extending it in the same style.
If they release a Qwen 3.6 that also makes good use of the card, may move to it.
There was a qwen-3.6 MoE six days ago that I thought was better than Gemma 4. Today's is a dense model. (gemma release both a 26B MoE and a 31B dense at the same time)
I have intention to evaluate all four on some evals I have, as long as I don't get squirrelled again.
So far I'm unimpressed for local inference. got 11 tokens per second on omlx on an M5 Pro with 128gb of ram, so it took an hour to write a few hundred lines of code that didn't work. Opus and Sonnet in CC the same task successfully in a matter of minutes. The 3.6:35b model seemed okay on ollama yesterday.
Need to check out other harnesses for this besides claude code, but the local models are just painfully slow.
vLLM apparently already has an implementation of turboquant available - which is said to losslessly reduce the memory footprint required by 6x and improve inference speed by 8x.
From what I understand, the steps are:
1. launch vLLM
2. execute a vLLM configure command like "use kv-turboquant for model xyz"
3. that's it
I've got two kids under 8 years old, a full time job, and a developer-tools project that takes like 105% of my mental interests... so there's been a bit of a challenge finding the time to swap from ollama to vLLM in order to find out if that is true.
SO buyer beware :D - and also - if anyone tries it, please let me know if it is worth the time to try it!
this is a dense model, so that's expected. On a mac you'd want to try out the Mixture of Experts Qwen3.6 release, namely Qwen3.6-35B-A3B. On an M4 Pro I get ~70 tok/s with it. If your numbers are slower than this, it might be because you're accidentally using a "GGUF" formatted model, vs "MLX" (an apple-specific format that is often more performant for macs).
School was like that back in Brazil in the 80s. Don’t get the grade, try again. And again. And again. Until you do. That is the real “not left behind”.
Yes. That's the whole idea of adaptive/personalized learning.
There are some interesting ethical questions about disparity of outcomes, but we should aim to have a system that educates each individual as best we can (without leaving kids behind). The current system of bludgeoning every individual student to be the same is not working well.
A recent Atlantic article [1] by someone involved in the Mississippi reforms gives a good outline of what they did/are doing. It includes science-based reading curriculum and holding kids back (as you mentioned). It also includes other forms of accountability, including parental notification and empowering the state to force recalcitrant districts to improve. One notable quote:
"The law allowed the state to abolish these districts’ local school board and remove the local superintendent in favor of a state appointee who would report directly to the state board of education. A later amendment provided that removed local-school-board members would be barred from serving in that capacity again."
Politically unpopular in some cases (which local jurisdiction wants the state coming in and replacing your local school board?), but seems to be pretty effective.
The challenge is token speed. I did some local coding yesterday with qwen3.6 35b and getting 10-40 tokens per second means that the wall time is much longer. 20 tokens per second is a bit over a thousand tokens per minute, which is slower than the the experience you get with Claude Code or the opus models.
Slower and worse is still useful, but not as good in two important dimensions.
Also benchmark measures are not empirical experience measures and are well gamed. As other commenters have said the actual observed behavior is inferior, so it’s not just speed.
It’s ludicrous to believe a small parameter count model will out perform a well made high parameter count model. That’s just magical thinking. We’ve not empirically observed any flattening of the scaling laws, and there’s no reason to believe the scrappy and smart qwen team has discovered P=NP, FTL, or the magical non linear parameter count scaling model.
It's kinda like saying a car with a 6L engine will always outperform a car with a 2L engine. There are so many different engineering tradeoffs, so many different things to optimize for, so many different metrics for "performance", that while it's broadly true, it doesn't mean you'll always prefer the 6L car. Maybe you care about running costs! Maybe you'd rather own a smaller car than rent a bigger one. Maybe the 2L car is just better engineered. Maybe you work in food delivery in a dense city and what you actually need is a 50cc moped, because agility and latency are more important than performance at the margins.
And if you're the only game in town, and you only sell 6L behemoths, and some upstart comes along and starts selling nippy little 2L utility vehicles (or worse - giving them away!) you should absolutely be worried about your lunch. Note that this literally happened to the US car industry when Japanese imports started becoming popular in the 80s...
This is just blind belief. The model discussed in this topic already outperforms “well made” frontier LLMs of 12-18 months ago. If what you wrote is true, that wouldn’t have been possible.
"Effective defense requires architectural change: treating OAuth apps as third‑party vendors, eliminating long‑lived platform secrets, and designing for the assumption of provider‑side compromise."
Designing for provider-side compromise is very hard because that's the whole point of trust...
As someone trying to think about OAuth apps at our SaaS, it certainly is very hard.
Do any marketplaces have a good approach here? I know Cloudflare, after their similar Salesloft issue, has proposed proxying all 3rd party OAuth and API traffic through them. But that feels a little bit like trading one threat vector for another.
Other than standard good practices like narrow scopes, shorter expirations, maybe OAuth Client secret rotation, etc, I'm not sure what else can be done. Maybe allowlisting IP addresses that the requests associated with a given client can come from?
This was probably partly a Google refresh token theft (given the length of the access). No inside info, just looking at how the attack occurred.
OAuth 2.1[0] (an RFC that has been around longer than I've been at my employer) recommends some protections around refresh tokens, either making them sender constrained (tied to the client application by public/private key cryptography) or one-time use with revocation if it is used multiple times.
This is recommended for public clients, but I think makes sense for all clients.
The first option is more difficult to implement, but is similar to the IP address solution you suggest. More robust though.
The second option would have made this attack more difficult because the refresh token held by the legit client, context.ai, would have stopped working, presumably triggering someone to look into why and wonder if the tokens had been stolen.
TL;DR: visibility and control for all connections (marketplace apps, 4th parties) to your 3rd party SaaS platforms. See which connections are active as well as when and how much they transmit. Transparent token splitting to force connections through the proxy as well as instant token revocation and ACLs.
We are currently building this and would appreciate your feedback!
Corroborates that zero-trust until now has been largely marketing gibberish. Security by design means incorporating concepts such as these to not assume that your upstream providers will not be utterly owned in a supply chain attack.
Organizations should do it not catastrophically wrongly, especially once a core design / concept is mostly solidified. Putting a little time into reliability and guardrails prevents a huge amount of downside.
I've been at organizations that don't think engineers should write tests because it takes too much time and slows them down...
reply