Papers
Topics
Authors
Recent
Search
2000 character limit reached

Qwen2.5-72B Transformer Model

Updated 4 February 2026
  • Qwen2.5-72B is a state-of-the-art dense transformer language model with 72 billion parameters, balanced multilingual training, and advanced multimodal capabilities.
  • It employs innovative architecture features such as Grouped-Query Attention and SwiGLU activations, enabling efficient handling of long-context tasks.
  • Pre-trained on 18 trillion tokens with supervised fine-tuning and reinforcement learning, it achieves competitive benchmarks across diverse domains.

Qwen2.5-72B is a state-of-the-art dense transformer LLM comprising 72 billion parameters, developed by Alibaba Cloud’s DAMO Academy as the flagship open-weight model of the Qwen2.5 series. It serves as the core for numerous domain-specialized and multimodal variants and is engineered to deliver competitive performance across natural language understanding, reasoning, mathematical problem solving, code generation, cross-lingual transfer, and large-context tasks. Qwen2.5-72B distinguishes itself by its scale, balanced multilingual and multimodal training corpora, advanced post-training strategies, and documented efficacy in both general and domain-specific benchmarks (Qwen et al., 2024, Yang et al., 2024, &&&2&&&, Kazoom et al., 7 Apr 2025, Xu et al., 26 May 2025, Cruz-Castañeda et al., 20 May 2025).

1. Model Architecture and Key Components

Qwen2.5-72B is a dense transformer decoder-only LLM, with no mixture-of-experts in the open-weight release. The fundamental architectural features and their precise configuration are as follows (Qwen et al., 2024, Cruz-Castañeda et al., 20 May 2025):

  • Parameters and Depth: 72 billion parameters, 80 transformer decoder layers (base LM); 64 layers and 12,800 hidden size (math variant).
  • Attention Mechanisms: Implements Grouped-Query Attention (GQA) with 64 query heads and 8 key/value heads for efficient KV caching, supporting rapid inference in long-context scenarios.
  • Feed-Forward Networks: Employs SwiGLU (Switchable Gated Linear Units) activations, with intermediate FFN dimensions set to roughly four times the hidden size.
  • Normalization and Positioning: Pre-layer normalization (RMSNorm), rotary positional embeddings (RoPE) with QKV bias, supporting context lengths up to 128K tokens and facilitating extended sequence tasks.
  • Multimodal Extension: For vision–language tasks, an aligned vision encoder, cross-modal adapters, learnable projections, and multimodal positional embeddings are added (Qwen2.5-VL-72B) (Kazoom et al., 7 Apr 2025, Xu et al., 26 May 2025).

The parameter composition adheres to:

P==1L[4H2+2HF+O(H)]72×109P = \sum_{\ell=1}^{L}[4H^2 + 2HF + O(H)] \approx 72 \times 10^9

where HH is the hidden dimension, FF the feed-forward size, and LL the number of layers.

2. Pre-training and Supervised Fine-Tuning

Qwen2.5-72B is pre-trained on 18 trillion tokens, with balanced representation across high-value domains (Qwen et al., 2024):

  • Corpus Composition: ~60% curated web text, ~15% programming/code corpora, ~10% math/scientific data, ~5% synthetic chain-of-thought reasoning for math and code, and up-sampled academic/technical resources. E-commerce and social media text are intentionally down-sampled.
  • Pre-processing and Filtering: A multi-stage filtering pipeline leverages Qwen2 Instruct model-based scoring, deduplication, and synthetic data from prior Qwen2.5 variants and reward models.
  • Objective: Autoregressive next-token prediction:

LLM=t=1Tlogp(xtx<t)L_{\mathrm{LM}} = -\sum_{t=1}^{T} \log p(x_t|x_{<t})

  • Context Curriculum: Training proceeds in two phases: first with a context length of 4K, then extending up to 32K tokens (and 128K for 72B) via ABF RoPE scaling.

After pre-training, Qwen2.5-72B undergoes:

  • Supervised Fine-Tuning (SFT): ~1 million high-quality, multi-domain instruction–response pairs. Categories include long-sequence generation, mathematics (with CoT), multi-language code, structured data understanding, logical reasoning, cross-lingual transfer, and robust prompting (Qwen et al., 2024, Cruz-Castañeda et al., 20 May 2025).
  • Optimization:
    • AdamW optimizer, weight decay 0.1, gradient clip 1.0.
    • 2 epochs, seq_len=32,768\text{seq\_len}=32,768, learning rate 7e67e77\mathrm{e}{-6} \to 7\mathrm{e}{-7}.

3. Reinforcement Learning and Self-Improvement

Post-training incorporates advanced preference optimization to align outputs with human preferences and enhance complex reasoning (Qwen et al., 2024, Yang et al., 2024):

LCLIP(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{\mathrm{CLIP}}(\theta) = \mathbb{E}_t\bigl[\min(r_t(\theta)\hat{A}_t, \mathrm{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_t)\bigr]

  • Mathematical Variant (Qwen2.5-Math-72B):
    • Iterative reward model—SFT—RLHF pipeline, integrating chain-of-thought data and tool-integrated reasoning, with reward shaping and best-of-N inference reranking for state-of-the-art math problem solving (Yang et al., 2024).

4. Multilingual, Multimodal, and Domain-Specific Specialization

Qwen2.5-72B is evaluated both as a generalist LM and as the backbone for specialized offshoots:

  • Multilingual QA: In history MCQ tasks spanning Lithuanian, Baltic, Nordic, and other languages, Qwen2.5-72B achieves 0.81 (Nordic), 0.77 (Baltic), and 0.82 (Multilingual) merged accuracies. It excels in general history (G ∼ 0.87) but lags in underrepresented Baltic-specific (LT-related) content (LT ∼ 0.71), revealing English- and web-centric corpus bias. Fine-tuning with Baltic-centric data is required to close this ∼10–15 percentage point gap (Kostiuk et al., 15 Jan 2025).
  • Portuguese Fine-Tuning (Amadeus-Verbo): SFT on 79K Brazilian Portuguese instructions enhances entailment (RTE, NLI), semantic similarity (STS), and toxicity/hate detection, with relative improvements most pronounced at the 72B scale—though latency and resource demands constrain deployment (Cruz-Castañeda et al., 20 May 2025).
  • Mathematical Reasoning: Qwen2.5-Math-72B-Instruct achieves 95.9 (GSM8K, CoT), 89.8 (MATH, RM@8), and 76.9 (GaoKao’23 En, RM@8), outperforming GPT-4o by 6.2 points on MATH, and by 17.5 points on Chinese math. RLHF-guided self-improvement and best-of-N reranking are crucial to this performance (Yang et al., 2024).

5. Multimodal and Retrieval-Augmented Model Capabilities

Extensions of Qwen2.5-72B into multimodal domains include Qwen2.5-VL-72B:

  • Vision-LLM (VLM): Incorporates a vision encoder, cross-modal adapters, and learnable projection for fused image–text reasoning in the shared transformer stack (Kazoom et al., 7 Apr 2025, Xu et al., 26 May 2025).
  • Ophthalmic Visual QA (OphthalWeChat benchmark): Qwen2.5-VL-72B-Instruct attains 0.666 (Binary_CN), 0.660 (Binary_EN), 0.617 (Single-choice_CN), and 0.597 (Single-choice_EN) accuracy, trailing Gemini 2.0 Flash and GPT-4o on holistic metrics and BLEU/BERTScore for open-ended questions. It remains competitive on Chinese closed-ended tasks but lacks sufficient domain-specific tuning for clinical-grade open-ended VQA (Xu et al., 26 May 2025).
  • Adversarial Patch Detection (VRAG framework): Without retraining, Qwen2.5-VL-72B achieves 94.2% accuracy (4-shot, 65×65 patch) on APRICOT and up to 91.5% on larger ImageNet-Patch variants. Cosine similarity (τ0.77\tau\approx0.77) in retrieval and comprehensive prompt engineering (including both patch and attacked-image exemplars) are essential to achieving robust detection—routinely outperforming smaller models, with a ∼10–15 point gain over Qwen-VL-Plus (Kazoom et al., 7 Apr 2025).

Table: Qwen2.5-72B Performance Across Selected Domains

Domain/Task Benchmark/Language Qwen2.5-72B Score Notable Comparator
Historical MCQ (Nordic) LT+G Merged 0.81 GPT-4o: 0.88
Portuguese NLI assin2_rte (macro-F1) 0.95 Amadeus Verbo report
General Math GSM8K (CoT, Eng) 95.9 GPT-4o: 92.9
Multimodal Adversarial APRICOT (4-shot, 65×65) 94.2% UI-TARS-72B: 96%
Ophthalmic VQA Binary_CN 0.666 Gemini 2.0: 0.687

6. Cost-Efficiency, Engineering Considerations, and Availability

  • Long-Context Optimization: ABF-scaled RoPE and careful memory-efficient attention enable up to 128K context handling in the base model.
  • Hardware and Throughput: Real-time throughput at 72B scale requires at least 8× H100 or equivalent GPUs; inference latency at batch_size=1 is typically hundreds of milliseconds.
  • Instruction Merging: Methods such as SLERP-based parameter interpolation (e.g., in Amadeus-Verbo) can further refine language-specific performance with post-hoc integration of base and instruct-tuned layers (Cruz-Castañeda et al., 20 May 2025).
  • Model and API Access: Open-weight models and language-specific derivatives (Amadeus-Verbo, Qwen2.5-Math) are available on HuggingFace and through Alibaba Cloud Model Studio.

7. Limitations and Research Outlook

  • Multilingual Gaps: Underexplored language domains (e.g., Baltic languages, expert clinical image QA) expose a reliance on web-dominant content in pre-training. Targeted corpus augmentation, retrieval-augmented generation, or prompt-based adaptation are required for parity in these settings (Kostiuk et al., 15 Jan 2025, Xu et al., 26 May 2025).
  • Domain-Specific Alignment: For medical and technical domains, instruction-tuning on high-fidelity, annotated corpora is essential to bridge factual gaps and enhance open-ended reasoning (Xu et al., 26 May 2025).
  • Scaling Laws and Efficiency: Proprietary MoE variants (Qwen2.5-Turbo, Qwen2.5-Plus) offer the promise of further cost-performance gains relative to dense baselines (Qwen et al., 2024). However, open-weight 72B models remain the principal avenue for wide reproducibility and academic investigation.

Qwen2.5-72B stands as a rigorously engineered, high-capacity transformer, defining the current state of the art among open-weight foundation models and offering a referential backbone for both generalist and highly specialized NLP and VLM research (Qwen et al., 2024, Kostiuk et al., 15 Jan 2025, Kazoom et al., 7 Apr 2025, Xu et al., 26 May 2025, Cruz-Castañeda et al., 20 May 2025, Yang et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qwen2.5-72B.