You have to compute attention between all pairs of tokens at each step, making t...

You have to compute attention between all pairs of tokens at each step, making the naive implementation O(N^3). This is optimized by caching the previous attention values, so that for each step you only need to compute attention between the new token and all previous ones. That's much better but still O(N^2) to generate a sequence of N tokens.