Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

But aren't the experts chosen on a token by token basis, which means bandwidth limitations?


Yes, with the direct conclusion from that being tl;dr in theory OPs explanation could mitigate RAM, in practice, it’s worse

(Source: I maintain an app integrated with llama.cpp, in practice no one likes 1 tkn/s generation times that you get from swapping, and honestly MoE makes RAM situation worse because in practice, model developers have servers and batch inference and multiple GPUs wired together. They are more than happy to increase the resting RAM budget and use even more parameters, limiting the active experts is about inference speed from that lens, not anything else)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: