My intuition is that questions that require reasoning always perform worse than direct retrieval questions, without exceptions. Especially when it's about negatives or when distractors are present. You're right though, intuition is not measuring, some relevant numbers would be nice to see.
ICL is a phenomenon separate from long-context performance degradation, they can coexist, similarly to how lost-in-the-middle affects the performance of examples in different positions just as fine.
Yeah ultimately it depends on the problem. Reading an article like this, it's easy to conclude that the context should always be reduced, all context relegated to a vector database[1], and retrieved on demand such that the context is as small as possible. Seeing it makes me want to refer to situations where conversely growing the context helps a lot to improve performance.
It really depends on the task, but I imagine most real world scenarios have a mixed bag of requirements, such that it's not a needle-in-a-haystack problem, but closer to ICL. Even memory retrieval (an example given in the post) can be tricky because you cannot always trust cosine similarity on short text snippets to cleanly map to relevant memories, and so you may end up omitting good data and including bad data (which heavily skews the LLM the wrong way).
[1]: Coincidentally what the post author is selling
I always disable reasoning when I can. It got over hyped because of deepseek when the short one sentence chain of thought most conversational models were trained to do seemed to be enough.
That's not what I mean. "Questions that require reasoning", i.e. indirect questions that require picking a fact in the context and processing it somehow, not necessarily related to reasoning chains models natively trained to do. Something GP is talking about.
Built-in reasoning chain certainly helps in long-context tasks, especially when it's largely trained to summarize the context and deconstruct the problem, like in Gemini 2.5 (you can easily jailbreak it to see the native reasoning chain that is normally hidden between system delimiters) and DeepSeek R1-0528, or when you're forcing it to summarize with a custom prompt/prefill. The article seems to agree.
ICL is a phenomenon separate from long-context performance degradation, they can coexist, similarly to how lost-in-the-middle affects the performance of examples in different positions just as fine.