Papers
Topics
Authors
Recent
Search
2000 character limit reached

Qwen2.5-3B: Dense LLM by Alibaba

Updated 20 February 2026
  • Qwen2.5-3B is a dense, autoregressive language model with 3B parameters, designed for multilingual tasks and code generation.
  • It employs architectural refinements like Grouped Query Attention, SwiGLU activations, and rotary embeddings for enhanced efficiency and long context handling.
  • Pretraining on a multi-trillion token corpus combined with supervised fine-tuning and RL alignment methods yields strong reasoning and benchmark performance.

Qwen2.5-3B is a dense, autoregressive LLM in the Qwen2.5 family developed by Alibaba Group, characterized by approximately 3 billion parameters. Serving as both a general-purpose pretrained model and as the foundation for specialized variants (including Qwen2.5-Coder and instruction-tuned versions), Qwen2.5-3B balances strong language understanding and reasoning capabilities with efficient deployment and fine-tuning properties. Its open-source weights have enabled a diverse array of downstream applications, particularly in multilingual, code generation, and resource-constrained environments (Qwen et al., 2024, Hui et al., 2024, Cruz-Castañeda et al., 20 May 2025, Gupta, 22 Feb 2025, Ashraf et al., 12 Sep 2025).

1. Architectural Configuration

Qwen2.5-3B is a dense GPT-style transformer decoder, employing architectural refinements introduced in the Qwen2.5 technical report (Qwen et al., 2024). The standard configuration features:

  • Layers (L): 36 decoder blocks
  • Hidden dimension (d): 2,304 (Qwen2.5 base/inst), 2,048 for code-specific (“Coder”) variant, 4,096 in select language specializations
  • Feed-forward dimension (d_ff): 9,216 (FFN; 4×d), or 4,864 (Coder), or 16,384 (Amadeus-Verbo/PT)
  • Attention heads: 16 query and 2 KV heads (GQA) (base, coder); 16–32 heads (variant-dependent)
  • Parameter count: ~2.98B, split between decoder (~2.8B) and embedding layers (~0.15B)
  • Context window: 4,096–32,768 tokens (pretrain), up to 8,192 tokens (generation cap typical posttraining), and extrapolation up to 131,072 tokens in Qwen2.5-Coder via RoPE and YARN scaling (Hui et al., 2024)
  • Vocabulary: 151,643 tokens (base/coder), 64,000 (Amadeus-Verbo/PT), BPE-pretrained multilingual tokenizers

Key algorithmic choices include Grouped Query Attention (GQA) to minimize KV cache and enable long context windows, SwiGLU activations in FFN, rotary position embeddings (RoPE) with QKV bias, and RMSNorm for stability. No mixture-of-experts (MoE) modules are present in the 3B variant (Qwen et al., 2024, Hui et al., 2024, Cruz-Castañeda et al., 20 May 2025).

2. Pretraining Corpus and Methodology

Qwen2.5-3B is pretrained on a highly curated, multi-trillion-token corpus weighted toward quality and domain diversity (Qwen et al., 2024, Hui et al., 2024):

  • Total tokens: 18T (Qwen2.5), 5.5T (Qwen2.5-Coder), ~2T (Amadeus-Verbo/PT)
  • Sources: Public web crawls, Wikipedia, books, high-quality code (from GitHub and other repositories, especially for Coder), multilingual corpora (dominant in English, Chinese), mathematical data, and filtered synthetic content.
  • Filtering: Multidimensional scoring by in-house instruction models, domain rebalancing, exact+fuzzy deduplication, reward-model filtering, up-sampling science/technical data, and test-leakage exclusion through LCS and overlap constraints.

Pretraining employs standard autoregressive maximum likelihood estimation with sequence lengths scaling from 2,048 up to 32,768 (and 131,072 via architectural extrapolation for select variants) in mixed bfloat16 precision with AdamW optimization (Qwen et al., 2024, Hui et al., 2024, Cruz-Castañeda et al., 20 May 2025).

3. Post-Training: Supervised Fine-Tuning and Alignment

Multiple post-training strategies are implemented across Qwen2.5-3B variants (Qwen et al., 2024):

  1. Supervised Fine-Tuning (SFT): Performed on >1M high-quality instruction-following samples, covering chain-of-thought (CoT) reasoning, multi-language tasks, long document generation, and structured data synthesis. Training uses 2 epochs with context windows up to 32,768 tokens (for base); learning rate linearly annealed from 7×1067\times10^{-6} to 7×1077\times10^{-7}.
  2. Preference-Based RL (DPO/GRPO): Direct Preference Optimization (DPO) is applied using ∼150k preference pairs to align outputs toward human preferences, avoiding reward model collapse. Online RL via Grouped Relative Policy Optimization (GRPO) integrates both human and synthetic preference labels (Qwen et al., 2024).
  3. Specialization via LoRA/NEFTune (Editor’s term): Intermediate weight adaptation for domain-specific fine-tunes (e.g., movie-dialogue, code generation) is performed using QLoRA adapters, 4-bit quantization, and NEFTune noise injection for regularization (Gupta, 22 Feb 2025).

These post-training protocols enable tailored performance for languages (Amadeus-Verbo/PT), code (Qwen2.5-Coder-3B), and dialogue-centric tasks.

4. Task Specialization and Downstream Application

Qwen2.5-3B’s open-weight foundation enables a suite of application-tuned and domain-specialized incarnations.

Code Generation (Qwen2.5-Coder-3B)

  • Architecture: 36 decoder layers, d=2048, embedded-tying, adjusted RoPE base for context extension (Hui et al., 2024).
  • Pretraining mix: 5.5T tokens, 70% code, 20% natural text, 10% math.
  • Benchmarks (base):
    • HumanEval: 52.4% pass@1
    • MBPP: 72.2% (1-shot), 65.2% (3-shot)
    • MultiPL-E: 48.0% (eight languages)
    • CRUXEval Input-CoT: 46.5%
    • Long-context: Recovers “needle” in 128K-token repo

General Language and Reasoning

  • Instruction-tuned (English, multi-lingual):
    • MMLU (zero-shot): 65.6%
    • GSM8K (4-shot): 79.1% (base); up to 86.7% (instruction-tuned)
    • HumanEval: 42.1% (base); 74.4% (instruction-tuned)
    • MBPP: 72.7% (instruction-tuned)
    • MATH: 65.9% (instruction-tuned)

Portuguese/Multilingual Adaptation (Amadeus-Verbo-3B)

  • Fine-tuning: ~600k pt-BR instruction/data pairs
  • Benchmark results: On entailment (assin2_rte) F1 = 0.92; NLI (faquad_nli) F1 = 0.80; law (oab_exams) = 0.47 accuracy (Cruz-Castañeda et al., 20 May 2025).
  • Applications: PT chatbots, sentiment moderation, legal Q&A; matches or slightly outperforms base-instruct models in Portuguese

Dialogue Generation

  • Cornell Movie-Dialog corpus (movie dialogue generation): Fine-tuned with QLoRA, DPO improves coherence/fluency metrics from 0.35 (base) to 0.64 (DPO) on coherence, and perplexity from 18.3 (base) to 13.9 (DPO) (Gupta, 22 Feb 2025).

5. Quantization, Efficiency, and Deployment

Qwen2.5-3B is designed for efficient deployment in both research and production (Qwen et al., 2024, Gupta, 22 Feb 2025):

  • Quantized variants: 8-bit and 4-bit GPTQ-style quantization; 4-bit footprint ≈6 GB, enabling inference on modest GPUs (e.g., 8 GB VRAM); performance drop on HumanEval and MATH is limited to ~1–2 points vs. bfloat16.
  • LoRA/QLoRA: Adapter-based fine-tuning enables low-resource adaptation, training only a fraction (few million) of additional parameters with frozen base.
  • Efficiency/energy: Chain-of-thought (CoT) prompting yields minor but measurable energy savings (~0.058%) and comparable or lower memory/runtimes compared to other 3B and 7B models (Ashraf et al., 12 Sep 2025).
  • Serving: Containerized with modern platforms (FastAPI + FlashAttention), real-time application on single 8 GB GPU is practical for short-turn (≤64 token) generation (Gupta, 22 Feb 2025).

6. Comparative Performance and Model Selection

Qwen2.5-3B consistently surpasses previous 1.5B-class models and competitive open-source LLMs of similar or larger parameter count, especially in code generation and multi-step reasoning (Qwen et al., 2024, Hui et al., 2024):

Metric Qwen2.5-3B Phi3.5-Mini (3.6B) MiniCPM3-4B Gemma2-2.6B
MMLU-Pro 43.7 47.5 43.0 52.2
MATH 65.9 48.5 46.6
GSM8K 86.7 86.2 81.1 30.3
HumanEval 74.4 72.6 74.4 19.5
MBPP 72.7 63.2 72.5

In specialized code completion, Qwen2.5-Coder-3B leads the open-source 3B segment (HumanEval-FIM 85.7%), and in green code generation it is notable for energy efficiency with CoT prompting (Ashraf et al., 12 Sep 2025).

7. Limitations and Deployment Recommendations

While Qwen2.5-3B delivers favorable trade-offs between quality and resource cost, it presents limitations:

  • Context window: Effective performance up to 8,192 tokens (base chat, PT), up to 128K for code; longer sequences may degrade output quality in most settings.
  • Alignment: Instruction-tuned checkpoints improve factuality, structure, and safety; base models are unaligned, susceptible to hallucinations for unsupported domains (Cruz-Castañeda et al., 20 May 2025, Qwen et al., 2024).
  • Domain specificity: For high-stakes use (e.g., medical/legal), human-in-the-loop validation and output filters are recommended; world knowledge is limited to pretraining cutoff (2025).
  • Language/Prompt sensitivity: Template-specific performance may vary; best results in PT under Amadeus-Verbo format, code domains under dedicated prompt/candidate selection.

For deployment, open-weight and quantized models are hosted on Hugging Face, ModelScope, and Kaggle. Researchers are advised to integrate CoT prompting and monitor resource metrics for sustainability, particularly in large-volume code generation contexts (Ashraf et al., 12 Sep 2025).


References:

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qwen2.5-3B.