Abstract: LongRoPE2 is a novel approach that extends the effective context window of pre-trained LLMs to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens -- 80x fewer than Meta's approach, which fails to reach the target effective context length. Code will be available at https://github.com/microsoft/LongRoPE.
The paper introduces LLMpresso, a method using evolutionary search and mixed training to achieve near-lossless 128k LLM context window scaling while preserving short-context performance.
The method employs evolutionary search guided by 'needle-driven' perplexity to identify critical RoPE dimensions and optimal rescaling factors for long-context attention.
Experimental results show LLMpresso outperforms prior methods on long-context benchmarks and retains 97.6% of original short-context performance with minimal training.
The paper introduces LLMpresso, a method for extending the context window of pre-trained LLMs while preserving performance on shorter contexts. LLMpresso addresses the out-of-distribution (OOD) issue in rotary positional embeddings (RoPE) by focusing on the hypothesis that higher RoPE dimensions are insufficiently trained, which affects the effectiveness of existing rescaling methods. The method includes a RoPE rescaling algorithm using evolutionary search guided by "needle-driven" perplexity (PPL) and mixed context window training.
The authors identify two major challenges in extending LLM context windows:
Existing rescaling methods do not achieve the effective target context length
Performance degradation on the original short context window.
The authors attribute these issues to insufficient training in higher RoPE dimensions, resulting in shorter effective RoPE rotation ranges.
LLMpresso includes the following innovations:
A RoPE rescaling algorithm that uses evolutionary search to identify critical RoPE dimensions and optimal rescaling factors, guided by a "needle-driven" perplexity evaluation.
A mixed context window training approach, which fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving short-context performance with the original RoPE.
λi​: rescaling factor for the ith RoPE dimension
θbase​: a predefined RoPE base value
d: attention head dimension
The constraint to avoid OOD is defined as:
λi​≥Ltrain​L​;fori≥dtcd​
λi​: rescaling factor for the ith RoPE dimension
L: target context window size
Ltrain​: pre-trained context window size
dtcd​: theoretical critical dimension
The evolutionary search identifies the real critical dimension drcd​ and the optimal rescaling factors using the following steps:
Initialize drcd​ and rescaling factors
Generate L-token documents
Compute PPL for each candidate by applying the rescaling factors to the LLM and evaluating the input X.
The theta base for drcd​ is updated after mutation, and NTK scaling is applied to rescale factors in the lower group.
The paper presents experiments on LLaMA3-8B and Phi3-mini-3.8B. The models were extended to a 128k context window and mid-trained on 64 A100 GPUs using a 10B-token dataset. Baselines include state-of-the-art RoPE rescaling methods such as YaRN, NTK, and LongRoPE.
The evaluation included:
Long-context stress tests, including RULER and Needle in a Haystack
Real-world long-context benchmarks including LOFT, InfiniteBench, and LongBench
Standard benchmarks within a 4096-token context.
Key results include:
{LLMpresso} consistently outperformed prior methods on RULER, achieving superior results across all evaluation lengths within the 128k window
{LLMpresso} achieves near-perfect accuracy across all evaluation lengths within the 128k context window in the Needle in a Haystack test.
{LLMpresso} consistently improves performance across all benchmarks, demonstrating strong generalization to practical scenarios, on real-world benchmarks
Ablation studies validated:
The effectiveness of real critical dimension drcd​
The effectiveness of need-PPL guided search
The effectiveness of mixed context window training
The authors conclude by noting that LLMpresso uses evolutionary search-guided rescaling and mixed context window training to achieve 128k effective context length with just 10B tokens, retaining 97.6\% of the original short-context performance.