Qwen2.5-72B Transformer Model

Updated 4 February 2026

Qwen2.5-72B is a state-of-the-art dense transformer language model with 72 billion parameters, balanced multilingual training, and advanced multimodal capabilities.
It employs innovative architecture features such as Grouped-Query Attention and SwiGLU activations, enabling efficient handling of long-context tasks.
Pre-trained on 18 trillion tokens with supervised fine-tuning and reinforcement learning, it achieves competitive benchmarks across diverse domains.

Qwen2.5-72B is a state-of-the-art dense transformer LLM comprising 72 billion parameters, developed by Alibaba Cloud’s DAMO Academy as the flagship open-weight model of the Qwen2.5 series. It serves as the core for numerous domain-specialized and multimodal variants and is engineered to deliver competitive performance across natural language understanding, reasoning, mathematical problem solving, code generation, cross-lingual transfer, and large-context tasks. Qwen2.5-72B distinguishes itself by its scale, balanced multilingual and multimodal training corpora, advanced post-training strategies, and documented efficacy in both general and domain-specific benchmarks (Qwen et al., 2024, Yang et al., 2024, &&&2&&&, Kazoom et al., 7 Apr 2025, Xu et al., 26 May 2025, Cruz-Castañeda et al., 20 May 2025).

1. Model Architecture and Key Components

Qwen2.5-72B is a dense transformer decoder-only LLM, with no mixture-of-experts in the open-weight release. The fundamental architectural features and their precise configuration are as follows (Qwen et al., 2024, Cruz-Castañeda et al., 20 May 2025):

Parameters and Depth: 72 billion parameters, 80 transformer decoder layers (base LM); 64 layers and 12,800 hidden size (math variant).
Attention Mechanisms: Implements Grouped-Query Attention (GQA) with 64 query heads and 8 key/value heads for efficient KV caching, supporting rapid inference in long-context scenarios.
Feed-Forward Networks: Employs SwiGLU (Switchable Gated Linear Units) activations, with intermediate FFN dimensions set to roughly four times the hidden size.
Normalization and Positioning: Pre-layer normalization (RMSNorm), rotary positional embeddings (RoPE) with QKV bias, supporting context lengths up to 128K tokens and facilitating extended sequence tasks.
Multimodal Extension: For vision–language tasks, an aligned vision encoder, cross-modal adapters, learnable projections, and multimodal positional embeddings are added (Qwen2.5-VL-72B) (Kazoom et al., 7 Apr 2025, Xu et al., 26 May 2025).

The parameter composition adheres to:

$P = \sum_{\ell=1}^{L}[4H^2 + 2HF + O(H)] \approx 72 \times 10^9$

where $H$ is the hidden dimension, $F$ the feed-forward size, and $L$ the number of layers.

2. Pre-training and Supervised Fine-Tuning

Qwen2.5-72B is pre-trained on 18 trillion tokens, with balanced representation across high-value domains (Qwen et al., 2024):

Corpus Composition: ~60% curated web text, ~15% programming/code corpora, ~10% math/scientific data, ~5% synthetic chain-of-thought reasoning for math and code, and up-sampled academic/technical resources. E-commerce and social media text are intentionally down-sampled.
Pre-processing and Filtering: A multi-stage filtering pipeline leverages Qwen2 Instruct model-based scoring, deduplication, and synthetic data from prior Qwen2.5 variants and reward models.
Objective: Autoregressive next-token prediction:

$L_{\mathrm{LM}} = -\sum_{t=1}^{T} \log p(x_t|x_{<t})$

Context Curriculum: Training proceeds in two phases: first with a context length of 4K, then extending up to 32K tokens (and 128K for 72B) via ABF RoPE scaling.

After pre-training, Qwen2.5-72B undergoes:

Supervised Fine-Tuning (SFT): ~1 million high-quality, multi-domain instruction–response pairs. Categories include long-sequence generation, mathematics (with CoT), multi-language code, structured data understanding, logical reasoning, cross-lingual transfer, and robust prompting (Qwen et al., 2024, Cruz-Castañeda et al., 20 May 2025).
Optimization:
- AdamW optimizer, weight decay 0.1, gradient clip 1.0.
- 2 epochs, $\text{seq\_len}=32,768$ , learning rate $7\mathrm{e}{-6} \to 7\mathrm{e}{-7}$ .

3. Reinforcement Learning and Self-Improvement

Post-training incorporates advanced preference optimization to align outputs with human preferences and enhance complex reasoning (Qwen et al., 2024, Yang et al., 2024):

Direct Preference Optimization (DPO): Offline RL using positive/negative pairs (~150K), sourced from SFT models and augmented with execution feedback; single pass with learning rate $7\mathrm{e}{-7}$ .
Group Relative Policy Optimization (GRPO): Online RL with reward models trained on human and automated preference data, optimizing objectives resembling PPO:

$L^{\mathrm{CLIP}}(\theta) = \mathbb{E}_t\bigl[\min(r_t(\theta)\hat{A}_t, \mathrm{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t)\bigr]$

Mathematical Variant (Qwen2.5-Math-72B):
- Iterative reward model—SFT—RLHF pipeline, integrating chain-of-thought data and tool-integrated reasoning, with reward shaping and best-of-N inference reranking for state-of-the-art math problem solving (Yang et al., 2024).

4. Multilingual, Multimodal, and Domain-Specific Specialization

Qwen2.5-72B is evaluated both as a generalist LM and as the backbone for specialized offshoots:

Multilingual QA: In history MCQ tasks spanning Lithuanian, Baltic, Nordic, and other languages, Qwen2.5-72B achieves 0.81 (Nordic), 0.77 (Baltic), and 0.82 (Multilingual) merged accuracies. It excels in general history (G ∼ 0.87) but lags in underrepresented Baltic-specific (LT-related) content (LT ∼ 0.71), revealing English- and web-centric corpus bias. Fine-tuning with Baltic-centric data is required to close this ∼10–15 percentage point gap (Kostiuk et al., 15 Jan 2025).
Portuguese Fine-Tuning (Amadeus-Verbo): SFT on 79K Brazilian Portuguese instructions enhances entailment (RTE, NLI), semantic similarity (STS), and toxicity/hate detection, with relative improvements most pronounced at the 72B scale—though latency and resource demands constrain deployment (Cruz-Castañeda et al., 20 May 2025).
Mathematical Reasoning: Qwen2.5-Math-72B-Instruct achieves 95.9 (GSM8K, CoT), 89.8 (MATH, RM@8), and 76.9 (GaoKao’23 En, RM@8), outperforming GPT-4o by 6.2 points on MATH, and by 17.5 points on Chinese math. RLHF-guided self-improvement and best-of-N reranking are crucial to this performance (Yang et al., 2024).

5. Multimodal and Retrieval-Augmented Model Capabilities

Extensions of Qwen2.5-72B into multimodal domains include Qwen2.5-VL-72B:

Vision-LLM (VLM): Incorporates a vision encoder, cross-modal adapters, and learnable projection for fused image–text reasoning in the shared transformer stack (Kazoom et al., 7 Apr 2025, Xu et al., 26 May 2025).
Ophthalmic Visual QA (OphthalWeChat benchmark): Qwen2.5-VL-72B-Instruct attains 0.666 (Binary_CN), 0.660 (Binary_EN), 0.617 (Single-choice_CN), and 0.597 (Single-choice_EN) accuracy, trailing Gemini 2.0 Flash and GPT-4o on holistic metrics and BLEU/BERTScore for open-ended questions. It remains competitive on Chinese closed-ended tasks but lacks sufficient domain-specific tuning for clinical-grade open-ended VQA (Xu et al., 26 May 2025).
Adversarial Patch Detection (VRAG framework): Without retraining, Qwen2.5-VL-72B achieves 94.2% accuracy (4-shot, 65×65 patch) on APRICOT and up to 91.5% on larger ImageNet-Patch variants. Cosine similarity ( $\tau\approx0.77$ ) in retrieval and comprehensive prompt engineering (including both patch and attacked-image exemplars) are essential to achieving robust detection—routinely outperforming smaller models, with a ∼10–15 point gain over Qwen-VL-Plus (Kazoom et al., 7 Apr 2025).

Table: Qwen2.5-72B Performance Across Selected Domains

Domain/Task	Benchmark/Language	Qwen2.5-72B Score	Notable Comparator
Historical MCQ (Nordic)	LT+G Merged	0.81	GPT-4o: 0.88
Portuguese NLI	assin2_rte (macro-F1)	0.95	Amadeus Verbo report
General Math	GSM8K (CoT, Eng)	95.9	GPT-4o: 92.9
Multimodal Adversarial	APRICOT (4-shot, 65×65)	94.2%	UI-TARS-72B: 96%
Ophthalmic VQA	Binary_CN	0.666	Gemini 2.0: 0.687

6. Cost-Efficiency, Engineering Considerations, and Availability

Long-Context Optimization: ABF-scaled RoPE and careful memory-efficient attention enable up to 128K context handling in the base model.
Hardware and Throughput: Real-time throughput at 72B scale requires at least 8× H100 or equivalent GPUs; inference latency at batch_size=1 is typically hundreds of milliseconds.
Instruction Merging: Methods such as SLERP-based parameter interpolation (e.g., in Amadeus-Verbo) can further refine language-specific performance with post-hoc integration of base and instruct-tuned layers (Cruz-Castañeda et al., 20 May 2025).
Model and API Access: Open-weight models and language-specific derivatives (Amadeus-Verbo, Qwen2.5-Math) are available on HuggingFace and through Alibaba Cloud Model Studio.

7. Limitations and Research Outlook

Multilingual Gaps: Underexplored language domains (e.g., Baltic languages, expert clinical image QA) expose a reliance on web-dominant content in pre-training. Targeted corpus augmentation, retrieval-augmented generation, or prompt-based adaptation are required for parity in these settings (Kostiuk et al., 15 Jan 2025, Xu et al., 26 May 2025).
Domain-Specific Alignment: For medical and technical domains, instruction-tuning on high-fidelity, annotated corpora is essential to bridge factual gaps and enhance open-ended reasoning (Xu et al., 26 May 2025).
Scaling Laws and Efficiency: Proprietary MoE variants (Qwen2.5-Turbo, Qwen2.5-Plus) offer the promise of further cost-performance gains relative to dense baselines (Qwen et al., 2024). However, open-weight 72B models remain the principal avenue for wide reproducibility and academic investigation.