Adapting language models to handle sequences longer than those seen during training

Develop effective methods to adapt pretrained transformer-based language models to handle inference sequences that exceed the training context length seen during pretraining.

Background

Transformer self-attention scales quadratically with sequence length, making pretraining at long contexts computationally prohibitive. When inference sequences exceed the training context, models that use explicit positional embeddings (such as RoPE) encounter out-of-distribution positional phases and sharply degraded performance.

Prior RoPE frequency-scaling methods (e.g., PI, RoPE-NTK, YaRN) aim to mitigate these issues but still typically require long-context finetuning and fail to generalize out-of-the-box on downstream tasks. This motivates the need for robust approaches to adapt LLMs to longer sequences beyond those seen during training.

References

Given the rapidly growing costs of self-attention, adapting LMs for longer sequences than those seen during training has been a longstanding open problem.

Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings  (2512.12167 - Gelberg et al., 13 Dec 2025) in Section 2, Context extension for RoPE