My hunch would be that even if we had a lot more annotated examples of reasoning and retrieval over 10,000+ tokens, the architectures we have today would still be limited.
It's inherent, see https://arxiv.org/abs/2002.07028 (as I detailed in my sibling comment to yours just now, but before I saw yours).
That said, there are architecture sizing ways that allow much better long-context performance at the cost of some short-context performance for a given parameter count and inference compute budget.
Having a LLM recall something with exact detail some 100k tokens ago sounds a bit like the ADHD test Cartman got in South Park. We don't recall exactly but rather a summarized version.
On the other hand, computers recall exactly when asked directly (RAM access) so in that sense it seems natural to want that from a LLM.
One thing we can do which current LLMs can't, at least directly as far as I'm aware, is to go back and re-read a section. Like on-demand RAG, or something.
In the meantime, good to know it's not terribly useful to have the full 128k context, as it usually is too much for my GPU anyway.
> One thing we can do which current LLMs can't, at least directly as far as I'm aware, is to go back and re-read a section. Like on-demand RAG, or something.
Encoders can do that.
And we can use them with diffusion to generate text [0].
This works because you don't impose a masked self attention for autoregressive decoding in the encoder, so subsequent layers can re-focus their key/query vector space to steer "backwards" information flow.
I've noticed this issue as well with smaller local models that have relatively long contexts, say a 8B model with 128k context.
I imagined they performed special recall training for these long context models, but the results seem... not so great.