How well separated are experts per domain in a model like that? Specifically, if...

renonce · 2025-07-12T10:35:36 1752316536

My experience is that experts are not separated in any intuitive way. I would be very interested (and surprised) if someone manages to prune a majority of experts in a way that preserves model capabilities in a specific domain but not others.

See https://github.com/peteryuqin/Kimi-K2-Mini, a project that keeps a small portion of experts and layers and keep the model capabilities across multiple domains.

viraptor · 2025-07-12T12:46:12 1752324372

Sounds like dumping the routing information from programming questions would answer that... I guess I can do a dump from qwen or deepseek locally. You'd think someone would created that kind of graph already, but I couldn't find one.

What I did find instead is that some MoE models are explicitly domain-routed (MoDEM), but it doesn't apply to deepseek which is just equally load balanced, so it's unlikely to apply to Kimi. On the other hand, https://arxiv.org/html/2505.21079v1 shows modality preferences between experts, even in mostly random training. So maybe there's something there.

orbital-decay · 2025-07-12T10:21:51 1752315711

Inseparable, routing is done per token in a statistically optimal way, not per request on the knowledge domain basis.

viraptor · 2025-07-12T12:48:06 1752324486

Sure, it's done per token, but the question is: how much do the knowledge domains match up with experts. I could not find hard data on this.

boroboro4 · 2025-07-12T15:35:56 1752334556

Check out DeepSeek v3 model paper. They changed the way they train experts (went from aux loss to different kind expert separation training). It did improve experts domain specialization, they have neat graphics on it in the paper.