Qwen2.5-7B: 7B Transformer LLM
- Qwen2.5-7B is a 7-billion parameter, dense, decoder-only Transformer model designed for diverse tasks like math, code generation, and multilingual understanding.
- It leverages advanced techniques such as Dual Chunk Attention and rotary positional encoding to support up to 128K tokens (or 1M in the instruct variant) with efficient inference.
- Extensive pre-training on 18 trillion tokens and fine-tuning via supervised and reinforcement learning drive state-of-the-art performance across academic benchmarks in reasoning, math, and coding.
Qwen2.5-7B is an open-weight, 7-billion parameter LLM in the Qwen2.5 series, engineered as a general-purpose, dense Transformer decoder. Building on foundational advances in pre-training scale, architecture refinement, and multi-stage post-training, Qwen2.5-7B is deployed extensively as both a base and instruction-tuned model. It constitutes the backbone for multiple specialized derivatives—including state-of-the-art math (Qwen2.5-Math-7B), code (Qwen2.5-Coder-7B), long-context (Qwen2.5-7B-Instruct-1M), and multilingual/Portuguese (Amadeus-Verbo Qwen2.5-7B) adaptations. Qwen2.5-7B consistently establishes or matches best-in-class results for its scale on diverse academic benchmarks in language understanding, reasoning, mathematics, coding, instruction following, and multilingual transfer (Qwen et al., 2024).
1. Architectural Specification
Qwen2.5-7B employs a dense, decoder-only Transformer with 28 layers, model hidden dimension , and grouped query attention (GQA) featuring 28 query heads and 4 key/value heads per layer, with each head size . Its feed-forward layers use a SwiGLU gating mechanism and intermediate size (base Qwen2.5-7B) or $18944$ (code specialization). Rotary positional encoding (RoPE) is combined with ABF scaling and QKV bias for robust long-context extrapolation. Normalization is RMSNorm applied in pre-norm configuration. The model uses a vocabulary of approximately 151,646 tokens; embedding weights are not tied to the output head at this scale (Yang et al., 2024, Qwen et al., 2024, Hui et al., 2024).
All Qwen2.5-7B variants use the same core architectural block design with no mixture-of-experts (MoE) layers, ensuring predictable inference footprint and full compatibility across base, instruction-tuned, long-context, code, and math-specialized versions. Dual Chunk Attention (DCA) with YARN enables efficient 128k-token context support for base and up to 1M-tokens for the Qwen2.5-7B-Instruct-1M variant (Qwen et al., 2024, Yang et al., 26 Jan 2025).
| Component | Qwen2.5-7B Base Value | Code-Specialized/Misc. |
|---|---|---|
| Layers | 28 | 28 |
| Hidden dimension | 3584 | 3584 |
| Attention heads | 28Q, 4KV | 28Q, 4KV |
| Head dimension | 128 | 128 |
| Feed-forward inner size | 14,336 | 18,944 (Coder/Math: varies) |
| Positional encoding | RoPE + ABF | RoPE (freq=1e6 for long ctx) |
| Norm/Activation | RMSNorm + SwiGLU | RMSNorm + SwiGLU |
| Context pre-train/infer | 32,768 / 128,000 | up to 1,000,000 (1M variant) |
2. Pre-Training Regimen and Data
The model is trained on 18 trillion tokens from a domain-balanced corpus (up from 7T in Qwen2), balancing down-sampled social media and e-commerce data with increased representation of technology, science, academic, code, and mathematical texts. High-quality data for math and code originates from Qwen2.5-Math and Qwen2.5-Coder corpora, combined with synthetic data generated by larger Qwen2-72B-Instruct and Qwen2-Math-72B peers. Quality is filtered by auxiliary reward models across multiple dimensions (Qwen et al., 2024).
Training objective is maximum-likelihood sequence modeling: Context length is first capped at 4,096 tokens, then curriculum-extended to 32,768. Context-specific long-sequence and position retrieval samples are included in long-context variants. Pre-training follows Chinchilla/Kaplan scaling-adjusted learning rates and batch sizing, with AdamW as optimizer (Qwen et al., 2024, Yang et al., 26 Jan 2025).
Specialized models (e.g., Qwen2.5-Math-7B, Qwen2.5-Coder-7B) reuse the backbone, but draw domain-optimized pre-training mixtures (70% code, 20% text, 10% math for Coder; >1T math tokens for Math; code-mixed and exam questions for Math) and integrate synthetic CoT/TIR or code data from strong teacher models (Yang et al., 2024, Hui et al., 2024).
3. Post-Training: Supervised and Reinforcement Methods
Following pre-training, Qwen2.5-7B undergoes:
- Supervised Fine-Tuning (SFT): Up to one million instruction-style examples from eight categories: long-form response, chain-of-thought (CoT) math, multilingual code, logical reasoning, structured data, instruction following (with execution-based rejection sampling), and prompt robustness. Training uses two epochs with context length up to 32k tokens and a learning rate annealed from to ; weight decay of 0.1 and gradient clipping at norm 1.0 (Qwen et al., 2024).
- Offline RL (DPO): Direct Preference Optimization on 150,000 preference pairs (positive/negative response) with the DPO loss function. An “Online Merging Optimizer” reduces alignment tax.
- Online RL (GRPO): Group Relative Policy Optimization for dialogue and preference alignment, using a PPO-variant surrogate and reward/prioritization (Qwen et al., 2024).
Specialized variants extend these steps. Qwen2.5-Math-7B uses an explicit self-improvement pipeline—iterative SFT/reward model selection, RM-guided rejection sampling, and group policy optimization—while Qwen2.5-Coder-7B integrates FIM (Fill-in-the-Middle) as an auxiliary objective and synthetic execution-filtered code samples (Yang et al., 2024, Hui et al., 2024).
4. Empirical Benchmarking
Qwen2.5-7B leads its class across standard academic benchmarks for LLMs. Representative scores (few/zero-shot):
| Task | Mistral-7B | Llama-3-8B | Gemma2-9B | Qwen2-7B | Qwen2.5-7B (base) | Qwen2.5-7B-Instruct |
|---|---|---|---|---|---|---|
| MMLU | 64.2 | 66.6 | 71.3 | 70.3 | 74.2 | – |
| BBH | 56.1 | 57.7 | 68.2 | 62.3 | 70.4 | – |
| GSM8K | 36.2 | 55.3 | 70.7 | 80.2 | 85.4 | 91.6 |
| MATH | 10.2 | 20.5 | 37.7 | 43.5 | 49.8 | 75.5 |
| HumanEval | 29.3 | 33.5 | 37.8 | 51.2 | 57.9 | 84.8 |
Qwen2.5-7B matches or outperforms all published open-weight LLMs at this scale on general, reasoning, math, and code benchmarks. Instruction tuning (SFT + RL) yields large gains, especially on math (MATH: 49.8→75.5%), program synthesis (HumanEval: 57.9→84.8%), and instruction-following tasks (Qwen et al., 2024).
Specialized Variant Performance
- Qwen2.5-Math-7B-Instruct: GSM8K (CoT) 95.2%, MATH 83.6%; supports CoT and Tool-Integrated Reasoning (TIR). Score-guided sampling (RM@N) further closes gap to SOTA (Yang et al., 2024).
- Qwen2.5-Coder-7B: HumanEval pass@1: 61.6% (base), MBPP: 76.9%, MultiPL-E (8 languages) 57.5%, strong FIM and long-context code retrieval (Hui et al., 2024).
- Qwen2.5-7B-Instruct-1M: >80% passkey retrieval at 1M-token context, matches 128K-ctx performance on short-context (Yang et al., 26 Jan 2025).
- Amadeus-Verbo Qwen2.5-7B: Achieves best STS (Pearson 0.81), Macro-F1 up to 0.74 in Portuguese tasks after full-parameter base and instruction tuning (Cruz-Castañeda et al., 20 May 2025).
5. Long-Context Scaling and Inference Efficiency
Qwen2.5-7B architecture supports inference with up to 128,000-token context by using Dual-Chunk Attention (DCA) with YARN temperature scaling, maintaining perplexity and retrieval accuracy across context sizes. The 1M-token extension (Qwen2.5-7B-Instruct-1M) integrates additional long-range pre-training, RoPE base frequency scaling, progressive length curricula, and multiple memory and kernel optimizations:
- Sparse Attention (MInference): Slash-pattern head-wise sparsity, reducing runtime by up to 10x for 1M context (Yang et al., 26 Jan 2025).
- BladeLLM kernels and DCPP: Pipeline parallel chunking, kernel fusion, and memory optimization deliver >25× acceleration over dense attention.
- Throughput: Community results indicate ~25 tokens/s (FP16) and ~60 tokens/s (4-bit quantized) on a single A100-40GB for the base 7B model (Qwen et al., 2024).
Quantized weights (4-bit, 8-bit) are supported via GPTQ, AWQ, and QLoRA; memory footprint for 4-bit 7B is ~4GB.
6. Multilinguality, Fine-Tuning, and Open-Source Ecosystem
Qwen2.5-7B natively supports ~30 languages, with model artifacts and deployment resources openly available from HuggingFace, ModelScope, and GitHub. All variants retain efficient inference on commodity GPUs (7B fits FP16 on 16GB VRAM; quantized deployment on 8GB; on-device PT-BR usage feasible). Fine-tuned and merged derivatives (Amadeus-Verbo for Portuguese, Qwen2.5-Math for advanced math, Coder for code generation) reuse the full Transformer without adapters, maintaining architectural integrity (Qwen et al., 2024, Cruz-Castañeda et al., 20 May 2025).
Instruction tuning is performed as full-parameter SFT, yielding improved accuracy for language-specific benchmarks, classification, similarity, and moderate gains on hard open-domain tasks. Spherical interpolation (SLERP) is used to merge base and instruct variants for greater versatility (Cruz-Castañeda et al., 20 May 2025).
7. Variant-Specific Enhancement and Limitations
Qwen2.5-7B contains no MoE layers at 7B scale, reflecting a trend toward maximal inference predictability and efficient quantization. Advanced post-training techniques (multi-stage RL, control-token expansion, structured verification) further enhance robustness in downstream tasks. The math and coding variants (Qwen2.5-Math-7B, Qwen2.5-Coder-7B) demonstrate that domain-adaptive pipelines can induce strong task-specific capabilities with no architectural modification.
Limitations remain in further context scaling—beyond 1M tokens—where intricate kernel/sparse scheduling and staged pre-training are required. For the Portuguese Amadeus-Verbo model, full-parameter tuning is computationally intensive (219 GPU-hours + model merging), and some domain gaps remain due to dataset limitations. All Qwen2.5-7B derivatives lack retrieval augmentation or external memory natively.
Qwen2.5-7B provides a high-performance, resource-efficient foundation for instruction-following, multi-language, coding, mathematical reasoning, and long-context NLP, with extensive support for quantization, fine-tuning, and cross-lingual adaptation, and with empirical results surpassing all previous Qwen1.5 and Qwen2 7B models and matching or exceeding Mistral-7B, Llama-3-8B, and Gemma2-9B across standard academic benchmarks (Qwen et al., 2024, Yang et al., 2024, Yang et al., 2024, Hui et al., 2024, Yang et al., 26 Jan 2025, Cruz-Castañeda et al., 20 May 2025).