More

jbellis · 2026-05-04T13:29:48 1777901388

Your calibration is wildly off. Asking people for a spot is totally normal at any gym with free weights.

Tade0 · 2026-05-04T13:43:42 1777902222

Spotting is a different thing, as you're communicating that you're entrusting your safety with that person.

Imagine someone instead asked you to wipe down the equipment for them or help putting the weights back. Different signal altogether.

outime · 2026-05-04T13:55:53 1777902953

That sends a different signal, because you're asking someone to do something you could do yourself but simply choose not to, which is essentially what you described above as "taking advantage of others". However this is quite different from what I described in my comment.

If you see every request for help as someone taking advantage of others, I'd encourage you to reconsider why you view everyone that way. It might also be preventing you from seeking help yourself, out of fear of being seen as a leech.

Tade0 · 2026-05-04T14:05:42 1777903542

> If you see every request for help as someone taking advantage of others

Let me rephrase, because there seems to be some kind of misunderstanding here:

To me this advice applied broadly would take the appearance of such a signal, even if weak. The framing of "do it because people like to help" is something which wouldn't even occur to me as motivation to ask for help.

al_borland · 2026-05-04T13:52:26 1777902746

Those examples aren't something a person needs help on, I think that's the difference. I can't spot my own lift. I can't teach myself what a certain machine does if I don't even know what it's called. I can't understand a new lift I haven't seen before without asking the person doing it what it is and a little about it.

Ask people for help where help is actually needed, not to act as your servant cleaning up behind you.

Tade0 · 2026-05-04T13:58:22 1777903102

The OP of this thread didn't specify the nature of the favours, just gave general advice which I think is not helpful.

jbellis · 2026-04-28T23:09:53 1777417793

How should I update my simplistic understanding that decode is bw-bound with these results that show the B70 decoding faster than a 4090 (about 50% more bw)?

rao-v · 2026-04-28T23:54:39 1777420479

I doubt you'd get the same sort of result on a modern-ish MOE or dense model via a more standard inference engine like llama.cpp or VLLM. I don't think MLPerf is a reasonable benchmark at this point.

Edit: Here is a simple llama.cpp compare where the token gen results match the rule of thumb.

https://www.reddit.com/r/LocalLLaMA/comments/1st6lp6/nvidia_...

jbellis · 2026-04-27T17:15:11 1777310111

Probably the best single resource is https://github.com/pmcfadin/awesome-accord

jbellis · 2026-04-27T15:42:37 1777304557

> Batches all operations. Does large number of reads/edits simultaneously...

I wasn't sure what this meant, so I looked at the source. It seems to be referring to tool APIs being designed around taking multiple targets as a list parameter, instead of hoping the model makes appropriately parallel tool calls. (This matches my experience btw, models are reluctant to make a large number of parallel calls simultaneously, and this seems more pronounced with weaker models.)

verdverm · 2026-04-27T15:47:50 1777304870

I think Anthropic may have mentioned this first, this pattern is also something my custom agent's tools are designed around, pretty sure I picked it up from them.

jbellis · 2026-04-27T12:50:21 1777294221

Branimir is an engineer's engineer. Excellent choice.

eatonphil · 2026-04-27T13:06:47 1777295207

His work is very cool and I was impressed by the thoroughness and thoughtfulness of his responses.

jbellis · 2026-04-24T18:12:51 1777054371

The newest structural diff tool is RefactoringMiner, there's a paper and a Github repo that works out of the box which is rare for this space. Excellent results but mainline is limited to Java IIRC with a couple ports for other languages.

jbellis · 2026-04-22T17:58:45 1776880725

what are you using for web search?

danielhanchen · 2026-04-30T05:21:48 1777526508

We use Duck Duck Go - sorry on the delayed response as well

jbellis · 2026-04-22T14:06:10 1776866770

Flash 2 isn't even at EOL until June but we started seeing ~90% error rates getting 429s over the weekend. (So we switched to GPT 5.4 nano.)

jbellis · 2026-04-21T16:09:38 1776787778

Isn't the "KV Compression Strategies (FAIR)" chart showing that the fancy complex algorithm only barely beats simple topk?

The commentary says that topk "degrades rapidly at low ratios" but the same can be seen for HAE (Entropy + OLS).

jchandra · 2026-04-21T17:25:27 1776792327

Fair point, the gap isn’t huge in that plot, and both degrade at low ratios. The difference is more in how they degrade: TopK can have sharper, localized failures, while HAE tends to be a bit more smooth. That doesn’t always show up strongly in average MSE.

That said, the gains are modest right now, this is still a research prototype exploring the tradeoff, and there’s clearly more work to be done.

bee_rider · 2026-04-21T18:57:39 1776797859

Is it really that fancy and complex, though? The “entropy recycling bin” seems fancy to me, but the other stuff is least squares and an SVD, these are solid workhorse numerical routines.

jbellis · 2026-04-16T19:35:05 1776368105

For coding, qwen 3.6 35b a3b solved 11/98 of the Power Ranking tasks (best-of-two), compared to 10/98 for the same size qwen 3.5. So it's at best very slightly improved and not at all in the class of qwen 3.5 27b dense (26 solved) let alone opus (95/98 solved, for 4.6).

kristianp · 2026-04-16T21:32:21 1776375141

This has similar problems to swe bench in that models are likely trained on the same open source projects that the benchmark uses.

https://blog.brokk.ai/introducing-the-brokk-power-ranking/

yorwba · 2026-04-16T21:55:45 1776376545

If all models are trained on the benchmark data, you cannot extrapolate the benchmark scores to performance on unseen data, but the ranking of different models still tells you something. A model that solves 95/98 benchmark problems may turn out much worse than that in real life, but probably not much worse than the one that only solved 11/98 despite training on the benchmark problems.

This doesn't hold if some models trained on the benchmark and some didn't, but you can fix this by deliberately fine-tuning all models for the benchmark before comparing them. For more in-depth discussion of this, see https://mlbenchmarks.org/11-evaluating-language-models.html#...

spwa4 · 2026-04-17T09:29:52 1776418192

It is much faster though. On my m1 max, describing a picture (quick way to get a pretty large context):

Qwen 3.6 35b a3b: 34 tok/sec

Qwen 3.5 27b: 10 tok/sec

Qwen 3.5 35b a3b: doesn't support image input

upboundspiral · 2026-04-17T17:09:27 1776445767

I've been using Qwen 3.5 35B-A3B with images as input so I suspect you perhaps didn't include the vision part of the model during testing (I use llama.cpp and I learned I needed to include the separate mmproj part).

m-emre · 2026-04-19T15:00:00 1776610800

What is the quantization level of your Owen 3.6 3b model?

__natty__ · 2026-04-16T20:09:59 1776370199

You compare tiny modal for local inference vs propertiary, expensive frontier model. It would be more fair to compare against similar priced model or tiny frontier models like haiku, flash or gpt nano.

javawizard · 2026-04-16T20:21:07 1776370867

Not when the article they're commenting on was doing literally exactly the same thing.

ericd · 2026-04-16T20:11:22 1776370282

Eh it’s important perspective, lest someone start thinking they can drop $5k on a laptop and be free of Anthropic/OpenAI. Expensive lesson.