Qwen2.5-1M Technical Report

Published 26 Jan 2025 in cs.CL | (2501.15383v1)

Abstract: We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M series have significantly enhanced long-context capabilities through long-context pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs. To promote the use of long-context models among a broader user base, we present and open-source our inference framework. This framework includes a length extrapolation method that can expand the model context lengths by at least four times, or even more, without additional training. To reduce inference costs, we implement a sparse attention method along with chunked prefill optimization for deployment scenarios and a sparsity refinement method to improve precision. Additionally, we detail our optimizations in the inference engine, including kernel optimization, pipeline parallelism, and scheduling optimization, which significantly enhance overall inference performance. By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup in scenarios with 1 million tokens of context. This framework provides an efficient and powerful solution for developing applications that require long-context processing using open-source models. The Qwen2.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo. Evaluations show that Qwen2.5-1M models have been greatly improved in long-context tasks without compromising performance in short-context scenarios. Specifically, the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.

Abstract PDF Upgrade to Chat

Summary

The paper introduces extended context handling by pushing LLMs from 128K to 1M tokens using progressive pre-training and advanced attention strategies.
The paper demonstrates the implementation of Dual Chunk Attention and sparse attention mechanisms, which yield a 3x to 7x speedup in inference performance.
The paper outlines a multi-stage supervised fine-tuning and reinforcement learning approach to balance long-context capabilities with short-context efficiency.

Qwen2.5-1M Technical Report

The Qwen2.5-1M Technical Report introduces a series of models from Alibaba Group that push the boundaries of long-context processing in LLMs, extending context lengths to 1 million tokens from the previous 128K. By innovating in both training and inference methodologies, the Qwen2.5-1M series aims to enhance the applicability of LLMs for complex, real-world tasks that necessitate handling extensive information.

Model Development and Features

Architecture and Training

The Qwen2.5-1M models are built upon the Transformer architecture used in the Qwen2.5 series, incorporating notable features such as Grouped Query Attention (GQA), SwiGLU activation function, and Rotary Positional Embeddings (RoPE). The series includes open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, alongside Qwen2.5-Turbo for API access.

The models are pre-trained using both natural and synthetic datasets to capture long-range token dependencies more effectively. Progressive context length expansion during training is employed to manage GPU memory consumption efficiently without compromising on learning long-context tasks. The series demonstrates extended capabilities in a variety of benchmarks, outperforming previous versions on long-context tasks.

Key Techniques

Long Data Synthesis: This involves creating datasets that prioritize long-range dependencies through synthetic data tasks such as Fill in the Middle, Keyword-Based Retrieval, and Paragraph Reordering.
Progressive Pre-training: Context length is gradually increased, starting from 4096 tokens up to 262,144 tokens, while adjusting RoPE's base frequency to handle longer contexts.
Multi-Stage Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL): To balance performance across sequence lengths and improve human alignment, a multi-stage sFT and RL approach is adopted.

Inference and Deployment

Framework and Efficiency

The aim is to reduce inference costs while maintaining high performance. This is accomplished through a combination of training-free length extrapolation and sparse attention methods. The inference framework boosts prefill speeds significantly, achieving a 3x to 7x speedup.

Key Innovations

Dual Chunk Attention (DCA): Remaps large relative positions into smaller ones, facilitating the handling of extended contexts.
Figure 1: An illustration of Dual Chunk Attention (DCA). DCA remaps the relative positions to smaller numbers, avoiding large relative positions that were untested during training.
Sparse Attention Mechanism: Inspired by MInference, this mechanism reduces the time complexity of attention to handle ultra-long sequences effectively. By employing patterns like Vertical-Slash, the model selectively computes attention only where critical.

Figure 2: Vertical-Slash pattern in MInference.

Length Extrapolation: Enabling the model to support longer contexts without re-training, this technique involves DCA and scaling mechanisms like YaRN to maintain performance over longer input sequences.

Evaluation and Performance

Benchmarking and Results

The report provides extensive evaluation metrics, showing significant improvements in handling long-context tasks compared to contemporaries like GPT-4 and other models. Qwen2.5-1M models maintain short-context performance, ensuring versatility.

Inference Speed and Efficiency

Performance benchmarks demonstrate notable speed improvements, particularly in Time to First Token (TTFT) across different GPUs and model configurations.

Figure 3: TTFT (Time to First Token) of Qwen2.5-7B-Instruct-1M, Qwen2.5-14B-Instruct-1M, Qwen2.5-Turbo on H20 and A100 GPUs.

Conclusion

The Qwen2.5-1M series exemplifies advancements in extending context lengths for LLMs efficiently. Through innovative training strategies and optimized inference frameworks, these models offer practical solutions for real-world applications requiring extensive context processing. Future efforts will continue to refine these models to enhance both operational efficiency and scope of applications. The open-sourced components of the Qwen2.5-1M series significantly contribute to the broader AI research community, facilitating further innovations in language modeling.