Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is this due to lack of specific long-context training, or is it more limitations of encoding or similar?

I've noticed this issue as well with smaller local models that have relatively long contexts, say a 8B model with 128k context.

I imagined they performed special recall training for these long context models, but the results seem... not so great.



Good question, I was wondering the same.

My hunch would be that even if we had a lot more annotated examples of reasoning and retrieval over 10,000+ tokens, the architectures we have today would still be limited.


It's inherent, see https://arxiv.org/abs/2002.07028 (as I detailed in my sibling comment to yours just now, but before I saw yours). That said, there are architecture sizing ways that allow much better long-context performance at the cost of some short-context performance for a given parameter count and inference compute budget.


Much appreciated, will read the paper tonight.

Having a LLM recall something with exact detail some 100k tokens ago sounds a bit like the ADHD test Cartman got in South Park. We don't recall exactly but rather a summarized version.

On the other hand, computers recall exactly when asked directly (RAM access) so in that sense it seems natural to want that from a LLM.

One thing we can do which current LLMs can't, at least directly as far as I'm aware, is to go back and re-read a section. Like on-demand RAG, or something.

In the meantime, good to know it's not terribly useful to have the full 128k context, as it usually is too much for my GPU anyway.


> One thing we can do which current LLMs can't, at least directly as far as I'm aware, is to go back and re-read a section. Like on-demand RAG, or something.

Encoders can do that. And we can use them with diffusion to generate text [0].

This works because you don't impose a masked self attention for autoregressive decoding in the encoder, so subsequent layers can re-focus their key/query vector space to steer "backwards" information flow.

Happy reading. Feel free to get back!

[0]: https://arxiv.org/abs/2211.15029




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: