Training-Free Long-Context Scaling of Large Language Models

Published 27 Feb 2024 in cs.CL | (2402.17463v2)

Abstract: The ability of LLMs to process and generate coherent text is markedly weakened when the number of input tokens exceeds their pretraining length. Given the expensive overhead of finetuning large-scale models with longer sequences, we propose Dual Chunk Attention (DCA), which enables Llama2 70B to support context windows of more than 100k tokens without continual training. By decomposing the attention computation for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens within the same chunk (Intra-Chunk) and across distinct chunks (Inter-Chunk), as well as integrates seamlessly with Flash Attention. In addition to its impressive extrapolation capability, DCA achieves performance on practical long-context tasks that is comparable to or even better than that of finetuned models. When compared with proprietary models, our training-free 70B model attains 94% of the performance of gpt-3.5-16k, indicating it is a viable open-source alternative. All code and data used in this work are released at \url{https://github.com/HKUNLP/ChunkLlama}.

Abstract PDF HTML Upgrade to Chat

References (60)

Citations (21)

View on Semantic Scholar

Summary

The paper introduces Dual Chunk Attention (DCA) to extend LLM context windows well beyond pretraining limits without additional training.
DCA employs intra-chunk, inter-chunk, and successive chunk attention to maintain both local detail and long-range dependencies effectively.
Experiments with Llama2 70B show a minimal perplexity increase (PPL 5.59) at 100k tokens, demonstrating the method’s practical efficiency.

Training-Free Long-Context Scaling of LLMs: Dual Chunk Attention

The paper, authored by Chenxin An and colleagues, introduces Dual Chunk Attention (DCA), an innovative approach designed to scale the context window of LLMs without necessitating additional training. This work focuses on expanding the effective context length of models such as Llama2, allowing them to consistently process and generate text for sequences exceeding their original training limits. The proposed method is particularly notable for enabling Llama2 70B to handle context windows of over 100k tokens, directly addressing the limitations imposed by the pretraining context length.

Introduction

At the core of this research is the challenge of maintaining coherence and processing efficiency in LLMs when dealing with long-context inputs. Existing LLMs are typically pretrained with a fixed context window, and fine-tuning them for longer sequences is often resource-intensive. Previous methodologies to extend the context length, such as Position Interpolation (PI) and NTK-Aware Rotary Positional Encodings (RoPE), require additional training steps or introduce significant PPL inflation with extended input lengths. This study presents a sophisticated yet efficient alternative, focusing on a training-free paradigm.

Methodology: Dual Chunk Attention (DCA)

The DCA framework introduces a novel approach by segmenting the attention computation of long sequences into chunk-based modules. This allows for capturing both intra-chunk and inter-chunk positional information effectively, integrating with Flash Attention to enhance performance and efficiency. DCA consists of three critical components:

Intra-Chunk Attention: This processes tokens within the same chunk, maintaining a fixed chunk size smaller than the pretraining window.
Inter-Chunk Attention: This mechanism allows attention computations across different chunks, thereby preserving long-range dependencies.
Successive Chunk Attention: This is designed to maintain locality by adjusting the position indices of tokens in neighboring chunks, ensuring accurate position representation for closely spaced tokens.

Through these components, DCA manages to retain global information and minimize perplexity across sequences, even when significantly extending the context length beyond the pretraining limits.

Numerical Validation

The experimental results presented in this paper underscore the efficacy of DCA. For instance, the Llama2 70B model, when equipped with DCA, achieves a perplexity (PPL) of 5.59 with a context length of 100k tokens. This is a negligible increase from its baseline PPL, showcasing DCA's ability to handle long-range dependencies efficiently. This performance stands in stark contrast to training-free methods such as PI and NTK, which show considerable PPL inflation beyond context lengths of 8k tokens.

Practical and Theoretical Implications

Practical Implications: DCA provides a cost-effective solution for various applications requiring the processing of extensive text sequences. This includes scenarios like analyzing extensive PDF documents, retaining long dialogue histories in conversational agents, or enabling high-resolution data summarization. By circumventing the need for repetitive and resource-intensive fine-tuning, DCA makes a strong case for practical deployment in real-world LLM applications.

Theoretical Implications: The introduction of chunk-based attention mechanisms with explicit intra-chunk and inter-chunk attention offers new insights into positional encoding and relative position matrix designs. This can stimulate further research into more refined attention mechanisms that can bridge the gap between local and global context comprehension in LLMs.

Future Directions

Given the promising results, future research might explore several avenues:

Optimization of Chunk Sizes: Analyzing the impact of varying chunk sizes on different model architectures and datasets could yield even more optimized configurations.
Hybrid Approaches: Combining DCA with other novel training-free approaches might further enhance the performance and scalability of LLMs.
Application Specific Tuning: Tailoring the DCA methodology for specific application domains such as biomedical text mining or legal document analysis could significantly advance domain-specific LLM capabilities.

Conclusion

The Dual Chunk Attention method presented by the authors marks a significant advance in the field of LLMs by enabling training-free long-context scaling. With robust numerical results and practical evaluations, DCA stands out as a highly efficient tool for extending the context windows of LLMs. This work not only provides an immediate solution to existing limitations but also paves the way for future advancements in scalable language modeling. The open-sourcing of their code and data further enhances the potential for community engagement and iterative improvement in this critical area of machine learning research.