Papers
Topics
Authors
Recent
Search
2000 character limit reached

YaRN: Efficient Context Window Extension of Large Language Models

Published 31 Aug 2023 in cs.CL, cs.AI, and cs.LG | (2309.00071v2)

Abstract: Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based LLMs. However, these models fail to generalize past the sequence length they were trained on. We present YaRN (Yet another RoPE extensioN method), a compute-efficient method to extend the context window of such models, requiring 10x less tokens and 2.5x less training steps than previous methods. Using YaRN, we show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow, while also surpassing previous the state-of-the-art at context window extension. In addition, we demonstrate that YaRN exhibits the capability to extrapolate beyond the limited context of a fine-tuning dataset. The models fine-tuned using YaRN has been made available and reproduced online up to 128k context length at https://github.com/jquesnelle/yarn

Citations (165)

Summary

  • The paper introduces YaRN, a method leveraging NTK-aware and NTK-by-parts interpolation to efficiently extend LLM context windows.
  • It demonstrates robust performance with low perplexity on long sequences using only 400 training steps and 0.1% of the original dataset.
  • The approach employs dynamic scaling and attention adjustments to preserve both local and global positional information in extended contexts.

Efficient Context Window Extension of Large LLMs

Introduction

The paper "YaRN: Efficient Context Window Extension of LLMs" (2309.00071) addresses a critical limitation of LLMs in natural language processing: the inability to generalize beyond the sequence length encountered during training. It introduces YaRN, a compute-efficient method for extending the context window of models using Rotary Position Embeddings (RoPE), such as LLaMA, GPT-NeoX, and PaLM models. The proposed method significantly reduces the amount of tokens and training steps required compared to existing methods.

Rotary Position Embeddings have been effective in encoding positional information in transformer architectures, but their extrapolation capabilities have been limited. Existing techniques like Position Interpolation (PI) and Dynamic NTK interpolation offer partial solutions but lack the ability to generalize to substantially longer sequences or maintain performance across various tasks.

Methodology

YaRN is a culmination of several interpolation strategies that enhance RoPE embedding scalability without extensive retraining:

  • NTK-aware Interpolation addresses the loss of high-frequency information when stretching RoPE embeddings indiscriminately. By spreading interpolation pressure across multiple dimensions, NTK-aware ensures high frequencies are retained while allowing low frequencies to be stretched more.
  • NTK-by-parts Interpolation segments RoPE embedding dimensions based on their periodic wavelengths. Dimensions with small wavelengths relative to the context size remain unchanged, ensuring local relationships are preserved, while those with larger wavelengths are interpolated to retain global positional information.
  • Dynamic Scaling adapts the scale factor dynamically during inference, allowing graceful degradation of performance as context length exceeds the trained limit. This technique complements NTK-aware and NTK-by-parts effectively, providing robust support for longer sequences without fine-tuning.
  • Attention Scaling adjusts the attention mechanism via a temperature application on pre-softmax logits, reparameterizing as scaling within RoPE embeddings for efficient attention adjustments. Figure 1

    Figure 1: Fix s=8s=8, compare the LLaMA 7b perplexity on 896 16k-token documents over different scaling 1/t1/\sqrt{t}.

    Figure 2

    Figure 2: Fix s=8s=8, compare the mean of perplexity change percentages ppl(t)−ppl(t=1)ppl(t=1)\dfrac{\text{ppl}(t) - \text{ppl}(t=1)}{\text{ppl}(t=1)} at different segments of token positions on 896 16k-token documents over different scaling 1/t1/\sqrt{t}.

Experiments

The training procedures for YaRN involved fine-tuning LLaMA-2 models to achieve context window extensions up to 128k sequences using significantly fewer tokens and steps. The methodology proved highly compute-efficient, surpassing alternatives at context window extension with about 400 training steps on approximately 0.1% of the original dataset.

Evaluation on GovReport and Proof-pile demonstrates the model's success in handling long sequences with low perplexity, confirming YaRN's ability to extrapolate context length beyond the limits of fine-tuning data. Figure 3

Figure 3: Sliding window perplexity (S=256) of ten 128k Proof-pile documents truncated to evaluation context window size.

Figure 4

Figure 4: Sliding window perplexity (S=256) of ten 128k Proof-pile documents truncated to evaluation context window size.

Implications and Future Developments

YaRN represents a significant advancement in the efficient extension of context windows in LLMs, enabling models to attend to longer contexts with preserved performance. This ability is crucial for applying transformer-based models to tasks requiring extensive context understanding, such as document summarization and long content generation. Future developments could explore further optimizations to improve model extrapolation and scalability across diverse datasets, adapting the scaling parameters dynamically to different model architectures and task requirements.

The technical approaches introduced in YaRN can support broader applications of AI in areas demanding nuanced understanding beyond immediate context, enhancing the models' adaptability and performance in varied real-world scenarios.

Conclusion

The research presented in the paper establishes YaRN as an effective solution for context window extension with minimal computational overhead and robust extrapolation capabilities. By integrating dynamic adaptation within RoPE and utilizing focused interpolation strategies, YaRN achieves superior performance on long sequence processing, opening avenues for further exploration in AI scalability and task generalization.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 23 tweets with 1136 likes about this paper.