Re 3) Even if you don't care about long context length, Mamba is much cheaper pe...

Re 3) Even if you don't care about long context length, Mamba is much cheaper per token of auto-regressive output. Each token has to only compute the next step of a linear RNN, the transformer has to attend back over all previous outputs, which rapidly grows in cost and memory.