For 1T Q4 - 1 token generated per every ~500GB memory read. So you'll need something like ~10TB/s memory for 20t/s. This is 8x5090 speed area and 16x5090 size area. HBM4 will bring us close to something really possible in home lab, but it will cost fortune for early adopters.
Speculative decoding/DFlash will help with it, but YMMV.
Edit:
Missed a part that this is A32B MoE, which means it drastically reduces amount of reads needed. Seems 20 t/s should be doable with 1TB/s memory (like 3090)
Interesting, at first pass I’d say the source availability has little to do with the topic at hand. But on second thought it might be rather significant. No company would finance making 2x identical cross platform apps, but if you have a pool of OS folks who are free to contribute at their leisure, the calculus changes a bit.
But isn't the whole point of linked article is that author doesn't like regular apps because it lacks control over UI and functionality compared to Web apps?
Being open-source is kinda even better in that regard.
reply