ReasonFlux Model

Updated 21 February 2026

ReasonFlux Model is a framework for hierarchical reasoning that employs reusable thought templates, reinforcement learning for planning, and adaptive inference feedback.
It improves performance in tasks such as mathematical reasoning, code generation, and process reward modeling, delivering significant accuracy gains over traditional chain-of-thought methods.
The model uses a three-stage workflow with a thought template library, a navigator LLM, and an inference LLM, dynamically scaling reasoning based on input complexity for optimal efficiency.

ReasonFlux is a family of frameworks for hierarchical reasoning in LLMs, distinguished by the use of reusable thought templates, structured hierarchical planning via reinforcement learning (RL), and adaptive inference-time feedback mechanisms. The suite encompasses systems for mathematical reasoning, code generation, and trajectory-aware process reward modeling, each advancing the state of the art in computational efficiency and empirical accuracy relative to prior chain-of-thought (CoT), tree-of-thought (ToT), and reward modeling baselines (Yang et al., 10 Feb 2025, Wang et al., 3 Jun 2025, Zou et al., 23 Jun 2025).

1. Hierarchical Template-Driven Reasoning in LLMs

ReasonFlux-32B approaches mathematical reasoning via a three-stage hierarchical workflow:

Thought Template Library: Approximately 500 high-level, domain-agnostic templates $\mathcal{D}_{\mathrm{temp}} = \{T_1, \ldots, T_m\}$ , each formalizing abstract solution principles (e.g., “trigonometric substitution,” “invariant principle”) through compact, metadata-rich modules.
Navigator LLM ( $\pi_\theta$ ): A 32B-parameter transformer leverages hierarchical RL to plan a template trajectory for each input problem $x$ , selecting and sequencing optimal templates from the library.
Inference LLM ( $\pi_{\mathrm{inf}}$ ): Instantiates individual steps by applying structured templates to subproblems, with navigator-driven iterative feedback for correction and refinement.

This Problem $\rightarrow$ Template Trajectory $\rightarrow$ Template Instantiation paradigm constrains the combinatorial space explored during reasoning and has been empirically shown to significantly improve both accuracy and computational efficiency over flat CoT or ToT prompts (Yang et al., 10 Feb 2025).

2. Structure and Principles of the Thought Template Library

Each template $T_i$ anonymizes domain expertise via the following fields:

Field	Description	Example
$T_{\mathrm{nam}}$	Template name	“ $\sqrt{R^2 - x^2}$ Type Trigonometric Substitution”
$T_{\mathrm{tag}}$	Keywords for retrieval	{"Trigonometric Substitution", "Irrational Function Optimization"}
$\pi_\theta$ 0	NL description & usage context	Converts $\pi_\theta$ 1 to trigonometric form
$\pi_\theta$ 2	Formal applicability scope	Integrals of form $\pi_\theta$ 3
$\pi_\theta$ 4	List of application steps	Recognize radical $\pi_\theta$ 5 subst. $\pi_\theta$ 6 $\pi_\theta$ 7 backsubstitute
$\pi_\theta$ 8	Worked examples	Provided for typical use cases

Templates are strictly designed for generality and compactness, supporting precise retrieval and cross-domain transfer. Template-augmented inference consistently yields 20–27 percentage point accuracy improvements across algebra, calculus, combinatorics, and geometry benchmarks (Yang et al., 10 Feb 2025).

3. Hierarchical Reinforcement Learning and Preference Optimization

ReasonFlux employs hierarchical RL to optimize the selection and ordering of templates, formulated as:

State $\pi_\theta$ 9: Pair $x$ 0, the original problem with chosen templates to date.
Action $x$ 1: Next template $x$ 2 to add to the trajectory.
Trajectory $x$ 3: Ordered sequence of template indices.
Reward $x$ 4: Averaged accuracy of $x$ 5 applying $x$ 6 to a set $x$ 7 of similar problems.

Objective is the expected trajectory reward: $x$ 8 A two-stage finetuning is performed: structure-based supervised fine-tuning (predicting template descriptions/scopes from names/tags) and preference learning using DPO-style loss on trajectory pairs $x$ 9 where $\pi_{\mathrm{inf}}$ 0.

Algorithmic innovations include hierarchical credit assignment at the trajectory level, curriculum scheduling from short to long template plans, and distillation from frontier solvers into reward underpinnings (Yang et al., 10 Feb 2025).

4. Adaptive Inference Scaling and Feedback Interplay

At test time, ReasonFlux adaptively scales the number and type of templates based on estimated input complexity:

Navigator $\pi_{\mathrm{inf}}$ 1 proposes a trajectory $\pi_{\mathrm{inf}}$ 2.
Templates are retrieved by keyword/tag, and instantiated stepwise via $\pi_{\mathrm{inf}}$ 3.
An interactive feedback loop refines trajectory decisions: navigator inspects outputs, either accepting, rejecting, or re-planning residual steps.
The scaling of both number of templates ( $\pi_{\mathrm{inf}}$ 4) and rounds of feedback ( $\pi_{\mathrm{inf}}$ 5) is parameterized as a learned affine function of a complexity score $\pi_{\mathrm{inf}}$ 6.

This dynamic interplay is critical for handling long-horizon problems and enables the system to allocate greater compute to more challenging inputs. Harder problems empirically require more templates and interaction rounds (see Figure 1 in (Yang et al., 10 Feb 2025)).

5. Empirical Performance and Ablations

ReasonFlux-32B sets new empirical state-of-the-art results with modest compute (8×A100 GPUs):

Benchmark	ReasonFlux-32B	o1-preview	DeepSeek-V3	Relative Gain
MATH	91.2%	85.5%	—	+6.7% over o1-preview
AIME 2024	56.7%	44.6%	39.2%	+27%/+45%
OlympiadBench	63.3%	—	55.4%	+7.9%

Ablation results highlight the necessity of all major components:

No templates: Large 32B flat CoT reaches only 76.6% (MATH), 9.3% (AIME).
Templates only: Non-RL template retrieval improves to ≈83% MATH.
HRL only: RL-planned trajectories without inference feedback yield 88% MATH.
Full ReasonFlux: All stages combined deliver 91.2% on MATH.

Statistical significance is confirmed by 16× repeated runs, $\pi_{\mathrm{inf}}$ 7 for main comparisons (Yang et al., 10 Feb 2025).

6. ReasonFlux Extensions: Co-Evolving Code and Process Reward Models

ReasonFlux methodology generalizes to code generation (ReasonFlux-Coder) and trajectory-aware process reward modeling (ReasonFlux-PRM):

ReasonFlux-Coder applies a co-evolutionary RL framework (CURE) to jointly optimize code and unit-test generation, without ground-truth code, via clipped PPO on both coder and tester branches. It demonstrably boosts code and test accuracy, Best-of-N scaling, and agentic coding pipelines relative to class-leading Qwen baselines (Wang et al., 3 Jun 2025).
ReasonFlux-PRM introduces trajectory-aware process reward models enabling both step-level and template-guided trajectory-level supervision. This allows for better selection of high-quality intermediate reasoning traces, more effective RL reward shaping, and enhanced test-time Best-of-N scaling. Empirical gains of 12.1% (SFT), 4.5% (RL), and 6.3% (test-time BoN) are documented across AIME, MATH500, and GPQA-Diamond benchmarks (Zou et al., 23 Jun 2025).

7. Practical Considerations, Limitations, and Future Directions

Compute and Latency: Full training (structure-SFT + trajectory RL) requires about one week on 8×A100s. Inference latency for a moderate Olympiad problem is 3–5 seconds, using 4–8k tokens and multiple template steps.
Failure modes include template mismatch (navigator picks misaligned template), incomplete library coverage (advanced disciplines unaddressed), and the need to extend template catalogues and retrain for domains outside math/code.
Scalability: Current libraries focus on high-school/olympiad math; applications to QA, code, and science in ReasonFlux-PRM are effective when domain-specific templates and PRMs are constructed (Zou et al., 23 Jun 2025).
Open research: Adaptive template library extension, dynamic weighting of reward aggregation ( $\pi_{\mathrm{inf}}$ 8 in PRM), and self-supervised generator–PRM co-training are open directions.

ReasonFlux frameworks collectively demonstrate that hierarchical decomposition via reusable templates, combined with trajectory-level RL and adaptive scaling, can dramatically optimize reasoning search space and deliver state-of-the-art results with reduced parameter and compute budgets (Yang et al., 10 Feb 2025, Wang et al., 3 Jun 2025, Zou et al., 23 Jun 2025).