Papers
Topics
Authors
Recent
Search
2000 character limit reached

ReasonFlux Model

Updated 21 February 2026
  • ReasonFlux Model is a framework for hierarchical reasoning that employs reusable thought templates, reinforcement learning for planning, and adaptive inference feedback.
  • It improves performance in tasks such as mathematical reasoning, code generation, and process reward modeling, delivering significant accuracy gains over traditional chain-of-thought methods.
  • The model uses a three-stage workflow with a thought template library, a navigator LLM, and an inference LLM, dynamically scaling reasoning based on input complexity for optimal efficiency.

ReasonFlux is a family of frameworks for hierarchical reasoning in LLMs, distinguished by the use of reusable thought templates, structured hierarchical planning via reinforcement learning (RL), and adaptive inference-time feedback mechanisms. The suite encompasses systems for mathematical reasoning, code generation, and trajectory-aware process reward modeling, each advancing the state of the art in computational efficiency and empirical accuracy relative to prior chain-of-thought (CoT), tree-of-thought (ToT), and reward modeling baselines (Yang et al., 10 Feb 2025, Wang et al., 3 Jun 2025, Zou et al., 23 Jun 2025).

1. Hierarchical Template-Driven Reasoning in LLMs

ReasonFlux-32B approaches mathematical reasoning via a three-stage hierarchical workflow:

  1. Thought Template Library: Approximately 500 high-level, domain-agnostic templates Dtemp={T1,,Tm}\mathcal{D}_{\mathrm{temp}} = \{T_1, \ldots, T_m\}, each formalizing abstract solution principles (e.g., “trigonometric substitution,” “invariant principle”) through compact, metadata-rich modules.
  2. Navigator LLM (πθ\pi_\theta): A 32B-parameter transformer leverages hierarchical RL to plan a template trajectory for each input problem xx, selecting and sequencing optimal templates from the library.
  3. Inference LLM (πinf\pi_{\mathrm{inf}}): Instantiates individual steps by applying structured templates to subproblems, with navigator-driven iterative feedback for correction and refinement.

This Problem \rightarrow Template Trajectory \rightarrow Template Instantiation paradigm constrains the combinatorial space explored during reasoning and has been empirically shown to significantly improve both accuracy and computational efficiency over flat CoT or ToT prompts (Yang et al., 10 Feb 2025).

2. Structure and Principles of the Thought Template Library

Each template TiT_i anonymizes domain expertise via the following fields:

Field Description Example
TnamT_{\mathrm{nam}} Template name R2x2\sqrt{R^2 - x^2} Type Trigonometric Substitution”
TtagT_{\mathrm{tag}} Keywords for retrieval {"Trigonometric Substitution", "Irrational Function Optimization"}
TdesT_{\mathrm{des}} NL description & usage context Converts R2x2\sqrt{R^2 - x^2} to trigonometric form
TscoT_{\mathrm{sco}} Formal applicability scope Integrals of form R2x2dx\int\sqrt{R^2 - x^2}\,dx
TaT_a List of application steps Recognize radical \rightarrow subst. x=Rsinθx=R\sin\theta \rightarrow backsubstitute
TexaT_{\mathrm{exa}} Worked examples Provided for typical use cases

Templates are strictly designed for generality and compactness, supporting precise retrieval and cross-domain transfer. Template-augmented inference consistently yields 20–27 percentage point accuracy improvements across algebra, calculus, combinatorics, and geometry benchmarks (Yang et al., 10 Feb 2025).

3. Hierarchical Reinforcement Learning and Preference Optimization

ReasonFlux employs hierarchical RL to optimize the selection and ordering of templates, formulated as:

  • State sts_t: Pair (x,{T1,,Tt1})(x, \{T_1, \dots, T_{t-1}\}), the original problem with chosen templates to date.
  • Action ata_t: Next template TatT_{a_t} to add to the trajectory.
  • Trajectory τ=(a1,,an)\tau = (a_1, \dots, a_n): Ordered sequence of template indices.
  • Reward R(τ)R(\tau): Averaged accuracy of πinf\pi_{\mathrm{inf}} applying τ\tau to a set Xsim\mathcal{X}_{\mathrm{sim}} of similar problems.

Objective is the expected trajectory reward: J(θ)=Ex,τπθ(x)[R(τ)]J(\theta) = \mathbb{E}_{x, \tau \sim \pi_\theta(\cdot|x)} [ R(\tau) ] A two-stage finetuning is performed: structure-based supervised fine-tuning (predicting template descriptions/scopes from names/tags) and preference learning using DPO-style loss on trajectory pairs (τ+,τ)(\tau^+, \tau^-) where R(τ+)>R(τ)R(\tau^+) > R(\tau^-).

Algorithmic innovations include hierarchical credit assignment at the trajectory level, curriculum scheduling from short to long template plans, and distillation from frontier solvers into reward underpinnings (Yang et al., 10 Feb 2025).

4. Adaptive Inference Scaling and Feedback Interplay

At test time, ReasonFlux adaptively scales the number and type of templates based on estimated input complexity:

  • Navigator πθ\pi_\theta proposes a trajectory T={T1,...,Tn}\mathbb{T}^* = \{T^*_1, ..., T^*_n\}.
  • Templates are retrieved by keyword/tag, and instantiated stepwise via πinf\pi_{\mathrm{inf}}.
  • An interactive feedback loop refines trajectory decisions: navigator inspects outputs, either accepting, rejecting, or re-planning residual steps.
  • The scaling of both number of templates (nn) and rounds of feedback (RR) is parameterized as a learned affine function of a complexity score c(x)c(x).

This dynamic interplay is critical for handling long-horizon problems and enables the system to allocate greater compute to more challenging inputs. Harder problems empirically require more templates and interaction rounds (see Figure 1 in (Yang et al., 10 Feb 2025)).

5. Empirical Performance and Ablations

ReasonFlux-32B sets new empirical state-of-the-art results with modest compute (8×A100 GPUs):

Benchmark ReasonFlux-32B o1-preview DeepSeek-V3 Relative Gain
MATH 91.2% 85.5% +6.7% over o1-preview
AIME 2024 56.7% 44.6% 39.2% +27%/+45%
OlympiadBench 63.3% 55.4% +7.9%

Ablation results highlight the necessity of all major components:

  • No templates: Large 32B flat CoT reaches only 76.6% (MATH), 9.3% (AIME).
  • Templates only: Non-RL template retrieval improves to ≈83% MATH.
  • HRL only: RL-planned trajectories without inference feedback yield 88% MATH.
  • Full ReasonFlux: All stages combined deliver 91.2% on MATH.

Statistical significance is confirmed by 16× repeated runs, p<0.01p<0.01 for main comparisons (Yang et al., 10 Feb 2025).

6. ReasonFlux Extensions: Co-Evolving Code and Process Reward Models

ReasonFlux methodology generalizes to code generation (ReasonFlux-Coder) and trajectory-aware process reward modeling (ReasonFlux-PRM):

  • ReasonFlux-Coder applies a co-evolutionary RL framework (CURE) to jointly optimize code and unit-test generation, without ground-truth code, via clipped PPO on both coder and tester branches. It demonstrably boosts code and test accuracy, Best-of-N scaling, and agentic coding pipelines relative to class-leading Qwen baselines (Wang et al., 3 Jun 2025).
  • ReasonFlux-PRM introduces trajectory-aware process reward models enabling both step-level and template-guided trajectory-level supervision. This allows for better selection of high-quality intermediate reasoning traces, more effective RL reward shaping, and enhanced test-time Best-of-N scaling. Empirical gains of 12.1% (SFT), 4.5% (RL), and 6.3% (test-time BoN) are documented across AIME, MATH500, and GPQA-Diamond benchmarks (Zou et al., 23 Jun 2025).

7. Practical Considerations, Limitations, and Future Directions

  • Compute and Latency: Full training (structure-SFT + trajectory RL) requires about one week on 8×A100s. Inference latency for a moderate Olympiad problem is 3–5 seconds, using 4–8k tokens and multiple template steps.
  • Failure modes include template mismatch (navigator picks misaligned template), incomplete library coverage (advanced disciplines unaddressed), and the need to extend template catalogues and retrain for domains outside math/code.
  • Scalability: Current libraries focus on high-school/olympiad math; applications to QA, code, and science in ReasonFlux-PRM are effective when domain-specific templates and PRMs are constructed (Zou et al., 23 Jun 2025).
  • Open research: Adaptive template library extension, dynamic weighting of reward aggregation (α,β\alpha, \beta in PRM), and self-supervised generator–PRM co-training are open directions.

ReasonFlux frameworks collectively demonstrate that hierarchical decomposition via reusable templates, combined with trajectory-level RL and adaptive scaling, can dramatically optimize reasoning search space and deliver state-of-the-art results with reduced parameter and compute budgets (Yang et al., 10 Feb 2025, Wang et al., 3 Jun 2025, Zou et al., 23 Jun 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ReasonFlux Model.