I'm pretty sure Qwen is faster? The MoE version of Qwen is 3B active, while Gemma 4 is 4B active. Similarly, the dense Qwen is 27B while Gemma is 31B. All else being equal (though I know all else isn't equal), Qwen should be faster in both cases. I haven't actually measured with any precision, but on my AMD hardware (Strix Halo or dual Radeon Pro V620) they seem quite similar in both cases...both MoE models are fast enough for interactive use, both dense models are notably smarter but much slower, long time to first response and single-digit tokens per second once it starts talking.
qwen-3.6 is really interesting. The dense 27B model is pretty slow for me whereas the sparse 31B is blazingly fast but it also needs to be since it's so chatty. It produces pages and pages of stream of consciousness stuff. 27B does this to a lesser extent but slow enough that I can actually read it whereas 31B just blasts by.
I haven't yet compared either to Gemma 4. I tried that out the day after it came out with the patched llama.cpp that added support for it but I couldn't make tool calling work and so it was kind of useless. I should try again to see if things have changed but judging by what people say, qwen-3.6 seems stronger for coding anyway.
Qwen without thinking is just as fast. I have 4 parameter settings based on recommendation. If you want a good coding problem, the thinking coding mode works well, but takes a while to arrive at an answer. If you want faster turn around time, instruction mode works without thinking.
flash is the fast (duh) model though. its not always beneficial to use pro. in practice: 1/ set to flash 3.1 ; 2/ force to pro...sometimes. mainly when the cli fails to predict what model to use.
note that it will sometimes fall back to flash 2, which sucks
Flash will absolutely destroy a complex codebase. It's like a drunk junior programmer. Don't trust it with anything more complex than autocomplete.
Pro is expensive, but good. However they've decreased the pitiful stipend they used to include in even the ultra plan to the point were it's barely usable. I pivoted back to ChatGPT Pro after the recent downgrade they gave Ultra users. Googles Ultra plan cost 2.5x as much and delivers about half the usage.
Tangent: this is one of those situations where slang is harmful to understanding. When I saw "will absolutely destroy" my first interpretation was a positive connotation. Of course further context made it clear you were being straightforward, and this isn't aimed at you. Along these lines, "drop" has become a problematic term: "Acme co dropped support for Foo" means it's EOL, but "Foo dropped today" implies it just landed. Idioms are hard enough when they don't serve as borderline autoantonyms.
To wrap up this extended digression, if anyone else finds this sort of thing interesting, and could use a good laugh, check out Ismo (a standup comic from Finland who makes truly hilarious observations about English as a second language).
Yeah I don't get the user who said Gemini is generous with the quota, I get more use out of codex with the 5 hour limits than Gemini gives me in a week
even 900mhz sux vs 433. the lower the better it penetrates matter for the same amplitude.
lower than 430 you start to run into severe bandwidth issues though. and its not allowed to transmit lora/dss on 430 in the us without license hence the 900mhz
at 2.4ghz the real world usage is limited. might as well use wifi. the only advantage is short range bandwidh while keeping lora compat.
Google is in a different position to others in that they're the only frontier lab with a cloud infra business. It obviously makes sense to sell GPUs on cloud infra as people want to rent them. In that respect Google buys a ton of GPUs to rent out.
What's unclear to me is how much Google uses GPUs for their own stuff. Yes Gemini runs on GPUs now, so that Google can sell Gemini on-prem boxes (recent release announced last week), but is any training or inference for Gemini really happening on GPUs? This is unclear to me. I'd have guessed not given that I thought TPUs were much cheaper to operate, but maybe I'm wrong.
Caveat, I work at Google, but not on anything to do with this. I'm only going on what's in the press for this stuff.
It mentions that Gemini can run on eight NVIDIA GPUs, but not which GPU and which Gemini model. Either way, this puts an upper bound of 288 * 8 = 2304 GB on the size of the Gemini model, which as far as I know has been a secret until now.
I have most likely outdated info, I left Google Research 4y ago. Back then, available TPU instances were plenty and GPU scarce. Nobody wanted to mess with an immature crashing compiler and very steep performance cliffs (performance was excellent only if you stayed within the guardrails, and being outside was supported and not even resulting in a warning - as it was so common in code).
But I believe most of it has changed for the better for TPUs.
they all hope to make lot of money of of it.
meshcore has a marketing team spamming reddit all day and a mao to make you believe people use it right now. then you connect to yhe mesh and you're utterly alone there. at least meshtastic has real users lol.
best is to use your own model router atm, depending on the task
reply