- The paper presents LongCache, a training-free method that extends LLMs to handle infinite contexts through selective cache-based attention.
- It employs a top-k cache selection mechanism that dynamically identifies global, middle, and local context segments for efficient processing.
- Experiments on LongBench and L-Eval benchmarks show that LongCache matches full-attention methods while processing context lengths up to 400K tokens.
ReAttention: Training-Free Infinite Context with Finite Attention Scope
Introduction
The paper "ReAttention: Training-Free Infinite Context with Finite Attention Scope" (2407.15176) addresses a core limitation in the deployment of LLMs: the finite context length these models can handle due to their inherent architectural constraints, particularly within attention mechanisms. Traditional strategies for length extrapolation to support longer contexts often involve modified attention mechanisms and can introduce explicit upper limits on context extensions. This work introduces LongCache, a novel approach that mitigates these limitations by enabling LLMs to process infinite contexts without adjusting the training process or expanding memory requirements.
Methodological Framework
LongCache Mechanism
LongCache operates by strategically selecting and processing the most critical context segments within a fixed attention window, a concept the authors refer to as "full-context cache selection."

Figure 1: Full-Context Cache Selection.
Cache Selection Process
The method involves parsing the input into global, middle, and local segments. The global and local segments represent highly relevant parts of the input for LLM operations — respectively the beginning and ending portions, which are known to be critical in many NLP tasks. Unlike strategies such as those used in StreamingLLM, which discard middle context sections, LongCache employs a selective approach. The model dynamically identifies and retains high-importance segments within the middle context using a top-k selection process.
Training-Free Attention Integration
Following selection, these coherent context segments are concatenated in a seamless sequence with newly applied position embeddings. This integration allows standard self-attention operations to proceed without regard for the actual spatial distances — only preserving relative order and ensuring that selected information does not degrade computational performance.

Figure 2: LongCache (Ours).
Experimental Validation
The authors validate LongCache through comprehensive testing on the LongBench and L-Eval benchmarks, along with Needle-In-A-Haystack tests. Remarkably, LongCache demonstrates comparable effectiveness to traditional full-attention methods across standard LLMs, such as LLaMA3 and Mistral, managing context lengths upwards of 400K tokens.





Figure 3: LLaMA3-8B-8K.
Comparative Analysis
When compared to existing extrapolation techniques such as Dynamic NTK and StreamingLLM, LongCache consistently outperforms or matches these approaches across context scaling benchmarks. The method's reliance on dynamic sparse attention via intelligent cache selection allows it to efficiently handle significantly extended contexts. This performance underscores LongCache's viability as a solution for handling extended context lengths without degrading performance or exceeding computational limits.
Discussion
Limitations and Strengths
LongCache leverages attention scores devoid of position embeddings for the initial cache-selection process, a deviation from conventional methods. The experiments reveal that such an approach indeed secures efficiently retrievable context, eschewing irrelevant noise that can arise from full-attention implementations with position embeddings.



Figure 4: Attention distribution of a correct case with position embedding.
Implications for Future Research
This study's implications are manifold, opening pathways for more efficient LLM deployment in tasks demanding extended context comprehension. LongCache sets a precedent for future research into context handling, especially concerning further optimizations that utilize GPU capacities for enhanced performance in practical applications.
Conclusion
In summary, the paper introduces LongCache, an innovative, efficiently implementable method for augmenting LLM context length capabilities. By wholly divorcing from traditional constraints of length extrapolation that rely heavily on training adjustments and expanded memory, LongCache presents a practical framework for enabling infinite context lengths within a finite computational setup. Future research and development continue to pivot around optimizing such systems for even greater efficiency and broader applicability across domains needing extensive document processing and comprehension abilities.