Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing

Published 18 Feb 2025 in cs.CL | (2502.12962v1)

Abstract: Limited by the context window size of LLMs(LLMs), handling various tasks with input tokens exceeding the upper limit has been challenging, whether it is a simple direct retrieval task or a complex multi-hop reasoning task. Although various methods have been proposed to enhance the long-context processing capabilities of LLMs, they either incur substantial post-training costs, or require additional tool modules(e.g.,RAG), or have not shown significant improvement in realistic tasks. Our work observes the correlation between the attention distribution and generated answers across each layer, and establishes the attention allocation aligns with retrieval-augmented capabilities through experiments. Drawing on the above insights, we propose a novel method InfiniRetri that leverages the LLMs' own attention information to enable accurate retrieval across inputs of infinitely length. Our evaluations indicate that InfiniRetri achieves 100% accuracy in the Needle-In-a-Haystack(NIH) test over 1M tokens using a 0.5B parameter model, surpassing other method or larger models and setting a new state-of-the-art(SOTA). Moreover, our method achieves significant performance improvements on real-world benchmarks, with a maximum 288% improvement. In addition, InfiniRetri can be applied to any Transformer-based LLMs without additional training and substantially reduces inference latency and compute overhead in long texts. In summary, our comprehensive studies show InfiniRetri's potential for practical applications and creates a paradigm for retrievaling information using LLMs own capabilities under infinite-length tokens. Code will be released in link.

Abstract PDF Upgrade to Chat

Summary

The paper introduces InfiniRetri, a novel training-free method that leverages intrinsic attention to retrieve information from unlimited-length contexts.
It demonstrates robust performance on tasks like the Needle-In-a-Haystack test, achieving 100% accuracy and significant improvements on LongBench datasets.
The method reduces computational overhead with optimized cache management and dynamic chunk processing, paving the way for scalable real-world applications.

Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing

Introduction

The paper "Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing" addresses the limitations faced by LLMs when dealing with tasks that require processing input tokens exceeding their context window size. Current methods either incur significant post-training costs, require additional tools, or exhibit limited improvements on realistic tasks. The authors introduce InfiniRetri, a novel method leveraging the intrinsic attention patterns of LLMs to enhance their capabilities for processing long contexts without additional training. This method achieves remarkable accuracy on tasks like the Needle-In-a-Haystack (NIH) test, surpassing previous state-of-the-art (SOTA) methods.

Technical Approach

The core insight of the paper is the observed correlation between attention distribution and retrieval-augmented capabilities across LLM layers. InfiniRetri exploits this alignment to accurately retrieve information from inputs of unlimited length using the models' inherent attention mechanisms rather than relying on external modules. This approach not only enhances retrieval capabilities but also reduces computational overhead and inference latency.

Observations and Methodology

The authors provide a detailed analysis of attention score distributions across various layers of LLMs. They demonstrate that layers closer to the model's output exhibit more pronounced attention patterns, aligning with question-focused retrieval capabilities. Figure 1 illustrates these attention patterns across layers, highlighting how they contribute to effective retrieval.

Figure 1: Visual representation of attention scores in Qwen2-7B-Instruct showing alignment with retrieval-augmented patterns.

The workflow of InfiniRetri involves segmenting the input text into manageable chunks and iteratively processing these segments while dynamically merging them with cached tokens. Figure 2 outlines this process, illustrating how the method maintains context and enables long-context processing.

Figure 2: Entire Workflow of Our Method InfiniRetri for Enhancing Long-Context Processing in LLMs.

Key Technical Components

Segment and Slide Window: Replicates human reading by processing text in segments, maintaining sequence order to enhance processing without finetuning.
Retrieval in Attention: Leverages layer-wise attention patterns to efficiently locate relevant information in large contexts.
Cache Management: Optimizes cache usage by storing token IDs rather than key-value pairs, facilitating longer context handling with lower computational cost.

Experimental Results

The evaluation of InfiniRetri demonstrates substantial improvements in long-context processing tasks. On the NIH test, the method achieves 100% accuracy with models like Mistral-7B-Instruct, processing over 1M tokens accurately, as depicted in Figure 3.

Figure 3: Performance Comparison on the Needle in a Haystack Task Using Mistral-7B-Instructv0.2.

In addition to this, InfiniRetri outperforms existing methods on LongBench datasets, particularly excelling in multi-document QA tasks like HotpotQA, showcasing up to 369.6% improvement with certain models.

Implications and Future Work

InfiniRetri presents a compelling training-free solution to LLM long-context processing, offering practical advantages in scenarios requiring extensive retrieval and question-answering capabilities. Its compatibility with existing Transformer-based architectures highlights its adaptability and potential for widespread application. This methodology not only sets new performance benchmarks but also proposes a paradigm shift from traditional context window scaling to enhancing intrinsic model capabilities.

Future research may focus on optimizing the method for summarization tasks, exploring integration with multimodal data, and expanding its applications across diverse NLP tasks, facilitating broader adoption in industry settings.

Conclusion

InfiniRetri sets a new standard in LLM long-context processing by innovatively leveraging attention mechanisms to achieve retrieval over infinite text lengths. This work offers substantial advancements in efficiency and accuracy, paving the way for enhanced applications of LLMs in complex real-world scenarios. The method's ability to function without additional training makes it a versatile tool for long-context NLP tasks, underscoring significant potential for future developments in AI and machine learning.

Markdown Report Issue