DeepSeek-R1 Distilled Models Overview

Updated 14 January 2026

DeepSeek-R1 distilled models are compact large language models derived via knowledge distillation from high-parameter, reasoning-optimized MoE architectures.
They balance reasoning accuracy with efficiency, performing robustly across domains like mathematical reasoning, coding, biomedical analysis, and financial evaluation.
Innovative distillation methods, including temperature-scaled KL divergence and hybrid SFT–RL pipelines, enable these models to retain complex chain-of-thought and self-verification capabilities with lower latency.

DeepSeek-R1 Distilled Models

DeepSeek-R1 distilled models are a family of compact LLMs derived via knowledge distillation from the DeepSeek-R1 line of high-parameter, reasoning-optimized MoE LLMs. These models aim to retain the advanced chain-of-thought (CoT) reasoning, self-verification, and stepwise inferential capabilities of their parent architectures while providing lower latency and reduced compute requirements compatible with edge and cloud inference. They are deployed across a wide range of domains, including mathematical reasoning, coding, biomedical and financial analysis, natural language evaluation, and safety-critical applications. The DeepSeek-R1 distilled model line spans several parameter scales, typically following the Qwen2.5 (1.5B, 7B, 14B, 32B), Llama-3.x (8B, 70B), and application-specific derivatives, each inheriting architectural conventions and pretraining/fine-tuning regimens from their respective teacher–student pairs.

1. Distillation Architectures and Training Paradigms

DeepSeek-R1 distilled models are constructed from dense Transformer backbones—either Qwen2.5, Llama-3.x, or vertical applications (e.g., biomedical or tabular reasoning)—by distilling reasoning behaviors from much larger MoE-based DeepSeek-R1 teachers. The teacher models, exemplified by the 671B/37B-activated R1-Zero architecture, are trained using RL (Group Relative Policy Optimization, GRPO), yielding complex emergent reasoning structures such as explicit stepwise CoT, self-verification, and dynamic multi-strategy descent (DeepSeek-AI et al., 22 Jan 2025).

The distillation methodology typically involves supervised fine-tuning (SFT) of the student on teacher-generated CoT traces, often with an additional Kullback–Leibler (KL) divergence loss between softened student and teacher logits at a specified temperature: $L_{KD} = \alpha T^2 D_{KL} \big(p_s/T \parallel p_t/T\big) + (1-\alpha) L_{CE}\big(p_s, y\big)$ where $p_t$ and $p_s$ are the teacher/student probabilities, $T$ the temperature, and $\alpha$ tunes the balance. Larger models (14B, 32B, 70B) reliably retain most teacher performance, especially for reasoning-centric tasks, while ultra-light variants (1.5B–8B) exhibit more substantial degradation, particularly for complex or formal reasoning (Jahin et al., 13 Mar 2025, Zhao et al., 16 Feb 2025).

Advanced or domain-specific variants may further apply LoRA (Zhang et al., 25 Apr 2025), knowledge distillation with negative signal utilization (Xu et al., 30 May 2025), instruction-driven hybrid datasets (for capability preservation) (Team et al., 9 Apr 2025), multi-stage SFT–DPO–RLVR pipelines, or synthetic CoT tree generation (via MCTS) (Yin et al., 3 Mar 2025). Specialization to vertical domains such as finance (Liu et al., 20 Mar 2025), biomedicine (Zhan et al., 1 Mar 2025), table reasoning (Yang et al., 29 May 2025), and medical QA (Zhang et al., 25 Apr 2025) is achieved by distilling domain-verified CoT traces and leveraging curated, instruction-verified corpora.

2. Quantitative Reasoning, Efficiency, and Scaling Properties

Systematic benchmarking on mathematics and logic datasets (e.g., MATH, AIME, GSM8K, GPQA, LiveCodeBench, MMLU Formal Logic) reveals substantial trade-offs between model size, inference speed, and reasoning fidelity (Jahin et al., 13 Mar 2025, DeepSeek-AI et al., 22 Jan 2025, Zhao et al., 16 Feb 2025):

Model	MATH (%)	GSM8K (%)	MMLU Formal Logic (%)	Latency (s/query)
DeepSeek-R1 (MoE, 236B)	90.45	96.13	97.62	81.0
DeepSeek-1.5B	65.64	81.12	47.62	5.0

Accuracy drops sharply (e.g., ΔMATH = –24.81, ΔLogic = –50.0) for the 1.5B student, particularly on tasks demanding multistep logical or symbolic reasoning, while latency improves by ≈16×.
Middle-tier students (7B–14B–32B–70B) retain a high fraction of teacher performance, with empirical performance tiering observed: 32B/70B students reach B–A tier on composite benchmarks, whereas 1.5B–8B variants cluster in C or D for logical reasoning (Zhao et al., 16 Feb 2025).
Distillation retains superior parameter efficiency over conventional SFT or RL on equivalently sized LLMs, but smaller models (≤7B) show persistent performance gaps despite optimization (Anjum, 30 Apr 2025).

In application-driven benchmarks (A-Eval-2.0, biomedical, financial, table reasoning), the best distilled variants frequently outperform base models by significant margins in logical reasoning, planning, and NER/classification, demonstrating clear scaling gains with parameter count and data quality (Zhan et al., 1 Mar 2025, Liu et al., 20 Mar 2025, Yang et al., 29 May 2025). Performance, however, plateaus beyond 32B–70B, and diminishing returns are observed in general language tasks.

3. Specialization, Vertical Distillation, and Domain Transfer

DeepSeek-R1 distilled models have been adapted to domain-specific, resource-constrained, or safety-critical settings:

Domain-Specific Distillation

Financial Reasoning: Fin-R1 leverages DeepSeek-R1 for CoT generation, and distills to Qwen2.5-7B via SFT and GRPO. Achieves 76.0% (FinQA), 85.0% (ConvFinQA), outperforming both vanilla 7B models and standard distillations by wide margins (Liu et al., 20 Mar 2025).
Biomedical NLP: Distilled R1 models (7B–70B) outperform Llama3-8B and Mistral-7B on event/relation extraction, NER, and text classification. F1-scores for event extraction (PHEE) often exceed 0.95 for the best variants (Zhan et al., 1 Mar 2025).
Medical QA: Through LoRA-augmented fine-tuning and multi-term distillation objectives (cross-entropy, KL, MSE, entity-level L1), the DeepSeek-R1-Distill-7B medical model achieves >92% USMLE Step 1 accuracy while reducing memory by 64.7% and inference latency by 12.4% versus standard 7B (Zhang et al., 25 Apr 2025).
Table Reasoning: Table-R1-SFT, a 7B-parameter Qwen or Llama3.1, distilled from 33k DeepSeek-R1 traces, achieves dramatic EM/accuracy gains on multiple tabular benchmarks in-domain and OOD, demonstrating structure-aware generalization beyond SFT baselines (Yang et al., 29 May 2025).

Safety-Aligned Distillation

RealSafe-R1: By SFT on 15k DeepSeek-R1-generated “safe reasoning” trajectories, RealSafe-R1 models (1.5–32B) achieve large boosts in refusal rates under adversarial prompting (e.g., XSTest unsafe prompts, Full Refusal from 26.5%→81.0%, and WildChat unsafe Full Refusal from 49.6%→67.8%) while preserving or modestly improving reasoning benchmark scores (Zhang et al., 14 Apr 2025).
Chinese Contexts: Distilled R1 series show mild safety degradations compared to base models, especially for discrimination domain prompts, but targeted safety fine-tuning on 50k safety+CoT mixtures fully recovers and exceeds base performance without notable reasoning loss (Zhang et al., 18 Mar 2025).
General Safety: Aggressive chain-of-thought distillation may erode pre-existing RLHF-aligned safety, leading to higher attack success rates and reduced refusal on unsafe prompts, as observed in distilled R1-70B (Zhou et al., 18 Feb 2025); mitigation requires large-scale, refusal-focused distillation and joint objective optimization.

4. Distillation Loss Functions, Learning Objectives, and Pipeline Innovations

Standard DeepSeek-R1 distillation recipes are centered around SFT on teacher CoT outputs with (optionally) a temperature-scaled KL-divergence loss. Variants extend this paradigm:

Hybrid and Layer-wise Distillation: Layer-level objectives and contrastive/margin-based alignment to teacher attention maps or top-k reasoning trajectories are advanced as future enhancements (Jahin et al., 13 Mar 2025).
Joint SFT–RL: For domains such as finance and table reasoning, reinforcement loops (GRPO with group-relative advantage normalization and accuracy/format rewards) are used post-SFT to recover self-verification abilities otherwise lost in compression (Liu et al., 20 Mar 2025, Yang et al., 29 May 2025).
Preference Optimization: Two-stage SFT–DPO (or REDI) pipelines exploit both positive and negative CoT traces, stabilizing updates and boosting robustness (Qwen-REDI-1.5B reaches 83.1% on MATH-500 with just 131k open examples, matching proprietary-distilled peers) (Xu et al., 30 May 2025).
Tree-Based and Length-Balanced CoT: MCTS-based tree CoT generation diversifies reasoning chains and, coupled with thoughts-length balancing and conservative DPO, reduces hallucinations and “over-thinking” in compact distilled students (Yin et al., 3 Mar 2025).
Catastrophic Forgetting Mitigation: “Holistic capability preservation” in models such as Ring-Lite-Distill is achieved via two-stage SFT (reasoning-only, then hybrid general+reasoning), followed by DPO—a regimen that recovers and even exceeds general-skill performance otherwise lost in pure CoT distillation (Team et al., 9 Apr 2025).

5. Empirical Performance, Domain Generalization, and Interpretability

Extensive benchmarking, ablation, and mechanistic analysis establish the following properties:

Performance Scaling: For challenging reasoning and planning tasks, larger distilled students (32B–70B) achieve A/B-tier accuracy, with diminishing returns above 32B for most practical applications; smaller variants remain highly effective for certain discriminative use cases despite lower overall scores (Zhao et al., 16 Feb 2025, Anjum, 30 Apr 2025).
Domain Transfer Robustness: Instruction SFT using DeepSeek-R1 traces significantly boosts OOD generalization in tabular reasoning, planning, and math (e.g., Table-R1-SFT outperforms baselines on 8/9 OOD table datasets; Fin-R1 surpasses 14B/32B distills in financial tasks) (Yang et al., 29 May 2025, Liu et al., 20 Mar 2025).
Discriminator vs Generator Roles: Ultra-compact R1 distills (1.5B) outperform code LLMs (e.g., CodeLlama-13B) as discriminators in text-to-SQL frameworks, while underperforming as generators—signaling a distinct separation between CoT-powered judgment vs sequence generation (Anjum, 30 Apr 2025).
Reasoning Integration: Mechanistic studies confirm that explicit reasoning traces causally modulate answer generation, with answer tokens attending strongly to previous reasoning tokens and intervention on reasoning-layer activations flipping final outcomes, especially for smaller models (Zhang et al., 28 Sep 2025).
Precision–Recall and Task-Specific Guidance: Distilled models maintain balanced precision and recall for NER/relation extraction in biomedical settings, with 32B variants striking the best balance under compute constraints (Zhan et al., 1 Mar 2025).

6. Limitations, Trade-Offs, and Future Directions

Distillation from massive reasoning LLMs into smaller students fundamentally faces a capability–efficiency trade-off:

Reasoning Degradation: Aggressive compression to 1.5B–8B removes critical reasoning pathways, causing marked degradation on formal logic, competition math, and “hard” instances (ΔAccuracy up to –50 points versus teacher) (Jahin et al., 13 Mar 2025).
Safety Regression: Standard distillation pipelines that prioritize reasoning accuracy generally reduce built-in safety alignment, leading to increased vulnerability to unsafe queries and adversarial attacks (Zhou et al., 18 Feb 2025, Zhang et al., 18 Mar 2025). Robust solutions require explicit incorporation of refusal-focused and process-level (not just output-level) safety objectives.
Catastrophic Forgetting: SFT on reasoning-only corpora leads to loss of instruction-following and tool use skills; two-stage hybrid fine-tuning, as in Ring-Lite-Distill, can restore generality without reasoning collapse (Team et al., 9 Apr 2025).
Parameter/Task Fit: For ultra-constrained scenarios, careful selection of application–model size pairs and possibly quantization or LoRA compression are recommended, bearing in mind non-uniform impact on reasoning vs. generation tasks (Zhao et al., 16 Feb 2025, Zhang et al., 25 Apr 2025).
Open Problems: Effective chain-of-thought distillation with holistic supervision, scalable safety alignment, robust multi-domain generalization, and minimal performance loss in ultra-light footprints remain active research challenges. Structured data generation (e.g., MCTS, tree CoT), negative-signal utilization (REDI), and hybrid teacher–student training (with RL or contrastive alignment) are among the promising lines for further exploration (Xu et al., 30 May 2025, Yin et al., 3 Mar 2025).

7. Recommendations and Best Practices

For maximal reasoning quality at moderate computational cost, utilize 14B–32B Qwen or 70B Llama distilled students, selecting the backbone to match deployment constraints and domain requirements.
Employ RL-based preference optimization (e.g., GRPO, DPO, REDI), with attention to difficulty balancing and roll-out pruning, to recover teacher-like self-reflection when feasible.
For safety-sensitive deployments, rely on safety-aligned SFT using explicit refusal-annotated reasoning traces and consider multi-objective optimization that jointly targets reasoning depth and robust refusal.
Preserve chain-of-thought and intermediate state alignment in the loss, augmenting SFT with margin-based, layer-wise, or attention-supervised terms as model and data permit.
For generalized or hybrid task needs, favor pipelines that blend large-scale reasoning SFT with balanced general-skills reinforcement (cf. holistic capability preservation).
Systematically monitor performance trade-offs versus teacher scale, especially in OOD, formal logic, and adversarial settings, to guide model selection and further fine-tuning.

References: