Ministral 3 Series Overview

Updated 14 January 2026

Ministral 3 Series are a family of dense LLMs with 3B, 8B, and 14B parameters, available in base, instruct, and reasoning variants.
The models utilize a decoder-only Transformer architecture with grouped query attention, RoPE embeddings, and support extended context lengths up to 256K tokens.
They achieve near-teacher performance through cascade distillation, efficient pruning techniques, and integrate image understanding via a lightweight ViT encoder.

The Ministral 3 Series comprises a family of parameter-efficient, dense LLMs constructed for compute- and memory-constrained applications. Released in three model sizes—3B, 8B, and 14B parameters—each size is available in three variants: a base pretrained model, an instruction-finetuned checkpoint, and a reasoning model optimized for complex problem-solving. All models are derived via Cascade Distillation from the Mistral Small 3.1 24B teacher and incorporate image understanding capabilities, with weights and configurations distributed under the Apache 2.0 license (Liu et al., 13 Jan 2026).

1. Architectural Overview and Model Sizes

All models in the Ministral 3 Series adopt a decoder-only Transformer backbone featuring Grouped Query Attention, RoPE positional embeddings, SwiGLU activations, and RMSNorm. They support contexts up to 256K tokens (128K for reasoning variants) and utilize a 131K vocabulary. Embedding strategies vary by size, with tied embeddings in the 3B and untied in larger models.

Model Size	Layers	Hidden Dim	Attention Heads	FFN Dim	Embedding Size	Context Length	Embeddings
14B	~40	5120	32 Q / 8 KV	16384	5120 (untied)	256K	Untied
8B	~34	4096	32 Q / 8 KV	14336	4096 (untied)	256K	Untied
3B	~26	3072	32 Q / 8 KV	9216	3072 (tied)	256K	Tied (input-output)

The 14B variant achieves >40% parameter reduction relative to its 24B teacher while retaining near-teacher performance.

2. Model Variants and Training Objectives

For each size, three distinct checkpoints are released:

Base (Pretrained): Trained for next-token prediction, using logits distilled from Mistral Small 3.1 (24B). The training set blends text-only and interleaved text+image streams, totaling 1–3T tokens. The distillation is conducted in two stages (short-context, then long-context).
Instruct (Instruction-finetuned): Extended with supervised fine-tuning (SFT) on high-quality multimodal/text instruction datasets. SFT loss ( $\mathcal{L}_{\text{SFT}}$ ) employs fp8 quantization and logit distillation from a stronger teacher (Mistral Medium 3). Online Direct Preference Optimization (ODPO) further refines the model using a pairwise reward model and a two-sided DPO loss:

$\mathcal{L}_{\text{ODPO}}(\theta) = -\mathbb{E}_{(x, y^+, y^-)}[w \cdot \log \sigma(\Delta_\theta) + (1 - w) \cdot \log(1 - \sigma(\Delta_\theta))]$

where $\Delta_\theta = \log \pi_\theta(y^+|x) - \log \pi_\theta(y^-|x)$ and $w = \Pr_{\text{PWRM}}(y^+ \succ y^-)$ .

Reasoning: Designed to improve chain-of-thought (CoT) performance through SFT on CoT data, Group Relative Policy Optimization (GRPO), and final ODPO polish. GRPO occurs in two stages: STEM RL (math/code/visual reasoning using an internal LLM judge) and General RL (broader chat/instruction with rubric-based rewards). The 3B model receives additional logit distillation from Magistral Small 1.2 to mitigate verbosity.

3. Cascade Distillation: Pruning and Knowledge Transfer

The training of Ministral 3 relies on Cascade Distillation, which iteratively prunes and distills from the 24B teacher through progressively smaller models. The procedure is:

teacher ← Mistral_Small_3.1
for size in [14B, 8B, 3B]:
    student₀ ← prune(teacher, target=size)
    student₁ ← distill(student₀, teacher, ctx=16K)
    student_final ← distill(student₁, teacher, ctx=256K)
    yield student_final
    teacher ← student₁  # cascade to next size

Layer Pruning: Layers are ranked by the mean ratio of output_norm to input_norm.
Hidden-dim Pruning: Principal Component Analysis (PCA) is applied across layer-norm inputs, followed by a global rotation and projection.
FFN Pruning: Each SwiGLU hidden dimension $i$ is scored as $s_i = \mathbb{E}[|\text{SiLU}(W_1 x)_i \cdot (W_3 x)_i|]$ , and only the top- $k$ dimensions are retained.
Distillation: The knowledge distillation loss per token $t$ is the forward KL:

$\mathcal{L}_{\text{KD}} = \mathrm{KL}(p_T(\cdot\,|\,x_{<t}) \Vert p_S(\cdot\,|\,x_{<t}))$

The process omits a next-token cross-entropy term (i.e., $\lambda=0$ ).

This iterative pipeline allows each smaller student to inherit knowledge from the previous model in the sequence, optimizing parameter efficiency at every step.

4. Evaluation and Comparative Performance

Comprehensive evaluation highlights smooth scaling behavior and strong performance relative to both the teacher and competing LLMs:

Pretrained Base vs. Teacher & Peers (5-shot):

Model	MMLU	MATH (CoT 2-shot)
Mistral Small 24B	81.0	55.8
Ministral 3 14B	79.4	67.6
Ministral 3 8B	76.1	62.6
Ministral 3 3B	70.7	60.1

Comparison to Qwen 3 & Gemma 3 (Base):
- Ministral 3 14B outperforms Qwen 3 14B on TriviaQA (74.9 vs 70.3) and MATH (67.6 vs 62.0).
- Ministral 3 8B surpasses Gemma 3 12B on most tasks, underscoring parameter efficiency.
Instruction-finetuned Evaluation (Arena Hard / WildBench / MATHmaj@1 / MM-MTBench):
- Ministral 3 14B: 55.1 / 68.5 / 90.4 / 84.9
- Qwen 3 14B: 42.7 / 65.1 / 87.0 / –
- Gemma 3 12B: 43.6 / 63.2 / 85.4 / 67.0
Reasoning Benchmarks (pass@16, LiveCodeBench@5):
- AIME 2024: Ministral 3 14B 89.8, Qwen 3 14B 83.7
- HMMT 2025: 14B 67.5, Qwen 3 14B 55.8
- LiveCodeBench: 14B 64.6, Qwen 3 14B 59.3

Performance scales smoothly with parameter count; the 14B variant approaches the performance of significantly larger teacher models.

5. Vision Capabilities and Multimodal Evaluation

All Ministral 3 variants possess image understanding abilities via a frozen 410M-parameter Vision Transformer (ViT) encoder derived from Mistral Small 3.1 Pixtral. The original projection layer of Pixtral is discarded; a new lightweight projection layer is trained per student. Training leverages interleaved image+text data in both distillation and SFT.

Multimodal Benchmark Results (2-shot):
- MMMU: 14B achieves 59.9 (teacher 59.1)
- MathVista: 14B reaches 43.6 (teacher 51.3, attributed to domain shift)

This setup enables efficient multimodal reasoning in a parameter-constrained regime.

6. Licensing, Deployment, and Practical Considerations

The entire Ministral 3 Series (three model sizes × three variants) is offered under the Apache 2.0 license, facilitating open-weight usage, modification, and commercial deployment. All checkpoints and configurations are accessible via HuggingFace, with supplementary technical documentation provided on the Mistral AI website.

The models are optimized for practical deployment on devices with limited computational resources and/or constrained GPUs, with native support for fp16 and int8 quantization. This parameter and FLOP efficiency broadens deployment use cases for sophisticated language and multimodal reasoning tasks within resource-limited environments (Liu et al., 13 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Ministral 3 (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ministral 3 Series.