Is this due to lack of specific long-context training, or is it more limitations...

jpcompartir · 2025-07-15T09:37:57 1752572277

Good question, I was wondering the same.

My hunch would be that even if we had a lot more annotated examples of reasoning and retrieval over 10,000+ tokens, the architectures we have today would still be limited.

namibj · 2025-07-15T12:55:34 1752584134

It's inherent, see https://arxiv.org/abs/2002.07028 (as I detailed in my sibling comment to yours just now, but before I saw yours). That said, there are architecture sizing ways that allow much better long-context performance at the cost of some short-context performance for a given parameter count and inference compute budget.

magicalhippo · 2025-07-15T17:11:45 1752599505

Much appreciated, will read the paper tonight.

Having a LLM recall something with exact detail some 100k tokens ago sounds a bit like the ADHD test Cartman got in South Park. We don't recall exactly but rather a summarized version.

On the other hand, computers recall exactly when asked directly (RAM access) so in that sense it seems natural to want that from a LLM.

One thing we can do which current LLMs can't, at least directly as far as I'm aware, is to go back and re-read a section. Like on-demand RAG, or something.

In the meantime, good to know it's not terribly useful to have the full 128k context, as it usually is too much for my GPU anyway.

namibj · 2025-07-15T18:52:40 1752605560

> One thing we can do which current LLMs can't, at least directly as far as I'm aware, is to go back and re-read a section. Like on-demand RAG, or something.

Encoders can do that. And we can use them with diffusion to generate text [0].

This works because you don't impose a masked self attention for autoregressive decoding in the encoder, so subsequent layers can re-focus their key/query vector space to steer "backwards" information flow.

Happy reading. Feel free to get back!

[0]: https://arxiv.org/abs/2211.15029