It's not complicated to explain. The model can handle 4000 tokens at once. So all you can do is work with the limitations of this window. You can use part of it to quote the previous interactions, and part of it for the response. If your content is too large, you need to summarise it. There are AIs for that too. If the output is too large, you need to split it in multiple rounds. It is pretty hard to work around this limitation, for example to write a whole novel.
I think we need LLMs capable of reading a whole book at once, about 100K tokens. I hope they can innovate the memory system of transformers a bit, there are ideas and papers, but they don't seem to get attention.
Complexity is quadratic in sequence length. For 512 tokens it is 262K, but for 4000 tokens it becomes 16M and goes OOM on a single GPU. We need about 100K-1M tokens to load whole books at once.
Since 2017 there have been hundreds of attempts to bring O(N^2) to O(N), but none of them replaced the vanilla attention yet in large models. They lose on accuracy. Maybe Flash attention has a shot (https://arxiv.org/abs/2205.14135).
I think we need LLMs capable of reading a whole book at once, about 100K tokens. I hope they can innovate the memory system of transformers a bit, there are ideas and papers, but they don't seem to get attention.