Magistral

Published 12 Jun 2025 in cs.CL | (2506.10910v1)

Abstract: We introduce Magistral, Mistral's first reasoning model and our own scalable reinforcement learning (RL) pipeline. Instead of relying on existing implementations and RL traces distilled from prior models, we follow a ground up approach, relying solely on our own models and infrastructure. Notably, we demonstrate a stack that enabled us to explore the limits of pure RL training of LLMs, present a simple method to force the reasoning language of the model, and show that RL on text data alone maintains most of the initial checkpoint's capabilities. We find that RL on text maintains or improves multimodal understanding, instruction following and function calling. We present Magistral Medium, trained for reasoning on top of Mistral Medium 3 with RL alone, and we open-source Magistral Small (Apache 2.0) which further includes cold-start data from Magistral Medium.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel, scalable reinforcement learning pipeline developed from scratch to boost LLM reasoning in math and coding tasks.
It employs an optimized GRPO algorithm that removes the KL penalty and innovates with advanced normalization and reward shaping techniques.
Results indicate significant performance gains on benchmarks, including cross-domain improvements and preserved multimodal reasoning abilities.

Magistral (2506.10910) introduces Mistral's first reasoning models, Magistral Medium and Magistral Small, developed using a novel and scalable reinforcement learning (RL) pipeline. The paper details a ground-up approach to RL for LLMs focused on enhancing reasoning abilities, particularly for mathematical and coding tasks, without relying on distillation from pre-existing reasoning models.

The core of the methodology is an optimized version of the Group Relative Policy Optimization (GRPO) algorithm. Key modifications include removing the KL divergence penalty, normalizing the loss by the total token length within a generation group, normalizing advantages sequence-wise within minibatches, and relaxing the upper bound of the clipping threshold ( $\varepsilon_{\text{high}}$ ) to encourage exploration and prevent entropy collapse. Groups with zero advantage (all generations correct or incorrect) are filtered out to improve gradient quality.

Reward shaping is crucial for training. The reward function evaluates generations based on four criteria:

Formatting: Checking for the required >/`tags,\boxed{}` for math, and markdown code blocks for code. Failure yields 0 reward; success grants 0.1.
Correctness: For math, the final answer within \boxed{} is extracted and verified using rule-based parsing and SymPy. For code, the first markdown block is compiled and executed against 20 random test cases with timeouts and memory limits. A correct answer/passing code grants an additional 0.9 reward (total 1.0).
Length Penalty: A soft penalty is applied near the maximum allowed completion length to discourage hitting the hard cutoff.
Language Consistency: For a subset of translated problems (10% into French, Spanish, Italian, German, Chinese, Russian), a fastText classifier checks if the prompt, thoughts, and answer are in the same language, providing an additional 0.1 reward if consistent. This simple technique effectively enables multilingual reasoning.

The system prompt provided to the model explicitly outlines the required format and language constraints, influencing the model's behavior during RL training.

A significant contribution is the scalable asynchronous RL infrastructure designed for efficiency on large GPU clusters. It consists of three types of workers: Trainers (perform gradient updates), Generators (create completions), and Verifiers (evaluate completions and assign rewards). To handle heterogeneous generation lengths and maximize throughput, generators operate continuously without waiting for trainers. Updated weights are broadcast from trainers to generators via NCCL mid-generation, with the key-value cache not being explicitly refreshed. Batches are defined by a fixed number of completions, and greedy collation of microbatches (fixed token size) within minibatches optimizes GPU utilization. The system manages the trade-off between throughput and on-policyness by controlling the number of concurrent generations relative to the batch size.

Data curation focuses on math and code problems with verifiable solutions. A two-stage difficulty filtering process, first using Mistral Large 2 and then a preliminary RL-trained model, was applied to math data to select problems of suitable difficulty (38k samples from ~700k). Code data from competitive programming platforms underwent filtering for test cases and solution agreement, resulting in 35k problems, some duplicated for C++ and Python.

The paper presents two main models:

Magistral Medium: Trained with pure RL on Mistral Medium 3. Achieved substantial performance gains (e.g., nearly 50% increase on AIME'24 pass@1) compared to the base model, showcasing the strength of the RL stack from scratch.
Magistral Small: Trained with SFT on reasoning traces generated by Magistral Medium, followed by RL on Mistral Small 3. This approach yielded the best performance for the 24B model, demonstrating that RL provides significant benefits even on smaller models, complementing or exceeding the gains from SFT distillation alone.

Results on various benchmarks (AIME, MATH, LiveCodeBench, GPQA, Humanity's Last Exam, Aider Polyglot) show Magistral models achieving high performance on reasoning and coding tasks. Multilingual evaluation on translated AIME'24 problems demonstrates effective reasoning and answering in the user's language with minimal performance drop.

Analysis of the RL training process revealed that weight updates primarily occur in a low-dimensional space, with a prominent "length" direction correlated with reward and generation length. Raw reward was observed to scale logarithmically with output length. Unexpectedly, RL training on text-only data maintained or even improved multimodal reasoning capabilities on benchmarks like MMMU and MMMU-Pro, suggesting cross-modal transfer of learned reasoning processes. Function calling and instruction following abilities were also preserved or slightly enhanced.

Ablation studies investigated training choices:

Cross-domain transfer was observed, with training on math improving code performance and vice versa.
For small models, pure RL was competitive with SFT distillation, and combining SFT with RL was superior.
Batch and minibatch sizing in the asynchronous setup is sensitive; performance degrades if too many minibatch updates ( $n_{\text{batch}} / n_{\text{minibatch}} > 2$ ) are performed per batch.
Different advantage normalization methods (minibatch, group, none) showed no significant impact on performance.
Unsuccessful approaches included using partial rewards for code problems (which led to slightly lower final performance) and relying on an entropy bonus term (which was unstable and dataset-dependent; tuning $\varepsilon_{\text{high}}$ was more effective for exploration).

An experimental run training Magistral Medium first with SFT on open-source traces (OpenThoughts, OpenR1) followed by RL demonstrated that RL can further boost performance significantly even on a strong SFT baseline, reaching levels comparable to DeepSeek-R1.

In conclusion, Magistral represents a significant step in applying scalable online RL to enhance LLM reasoning. The paper provides practical insights into algorithm modifications, reward design, infrastructure implementation, and data curation for effective RL training, highlighting that pure RL and RL combined with SFT bootstrapping are powerful approaches for developing reasoning capabilities, including unexpected improvements in multimodal understanding and preservation of other core abilities.

Markdown Report Issue