$\textbf{Re}^{2}$: Unlocking LLM Reasoning via Reinforcement Learning with Re-solving
Abstract: Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning performance of LLMs by increasing test-time compute. However, even after extensive RLVR training, such models still tend to generate unnecessary and low-quality steps in their chain-of-thought (CoT), leading to inefficient overthinking and lower answer quality. We show that when the initial direction or quality of the CoT is suboptimal, the model often fails to reach the correct answer, even after generating several times more tokens than when the initial CoT is well-initialized. To this end, we introduce Reinforcement Learning with Re-solving (Re$2$), in which LLMs learn to flexibly abandon unproductive reasoning paths and restart the solution process when necessary, rather than always committing to a final answer. Re$2$ applies pure reinforcement learning without any preliminary supervised fine-tuning, successfully amplifying the rare redo behavior in vanilla models from only 0.5% to over 30%. This leads to substantial performance gains over standard RLVR under the same training compute budget, and also demonstrates notable improvements in test-time performance as the number of samples increases.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper is about teaching LLMs—the kind of AI that writes and reasons with text—to think more like good problem-solvers. The authors introduce a new training method called “Re2” (short for “Reinforcement Learning with Re-solving”) that helps a model notice when it’s going down a bad path and start over, instead of stubbornly continuing a weak plan. Think of it like hitting “reset” when your approach to a tricky math problem clearly isn’t working.
What were they trying to find out?
The researchers focused on two simple questions:
- Do LLMs actually do better when they “think” longer, or can extra steps sometimes make things worse?
- Can we train an LLM to abandon a bad plan early and restart, so it’s more likely to reach the right answer?
How did they do it?
They used reinforcement learning (RL), which you can think of as a “practice with points” system:
- When the model solves a problem correctly, it gets a point (reward).
- When it gets it wrong, it gets zero.
- New in this paper: the model also gets an option to “re-solve”—to stop and start over. If it chooses to restart, it gets a reward based on how likely a fresh attempt is to succeed (estimated from other attempts).
To train this behavior, the authors:
- Gave the model many problems and had it write partial solutions (like the first 20–80% of a plan).
- From each partial plan, they asked the model to continue in different ways: keep going to a final answer or decide to re-solve.
- They grouped these continuations by the same starting partial plan, then scored them. Correct answers got 1 point. Wrong answers got 0. Re-solve decisions got a “smart” reward equal to the model’s estimated chance of succeeding if it starts from scratch.
- Over time, the model learned a simple rule of thumb: if your current path looks promising, finish it; if it looks messy or wrong, restart.
Key terms in everyday language:
- Chain of Thought (CoT): the step-by-step explanation the model writes while solving. Like showing your work on a math test.
- Reinforcement Learning (RL): training by trial and error with points. Good actions get more points and become more likely next time.
- Re-solve: the model’s “reset” button—drop the current plan and try again from the beginning.
What did they discover?
The authors found several important things:
- Longer “thinking” isn’t always better. For the same question, longer step-by-step answers were often less accurate when the early steps were wrong. In other words, if the beginning is bad, more steps usually just dig a deeper hole.
- Early mistakes are sticky. If a model starts off in the wrong direction, it rarely recovers—even if you let it write a lot more steps.
- Teaching models to restart helps a lot. Using Re2, the model learned to detect unproductive paths and restart. The “redo” behavior jumped from about 0.5% (almost never) to over 30% (fairly common when needed).
- Better scores across many tests. On math and science benchmarks—from easier word problems (GSM8K) to hard competitions (AIME 2024/2025) and advanced science questions (GPQA-Diamond)—models trained with Re2 consistently beat a strong existing RL method (called DAPO), under the same training budget.
- Scales well with more tries. When you let the model try multiple times and pick the best answer (like majority voting), Re2 keeps improving as you add more samples. Standard methods tended to level off earlier.
- Training behavior looks sensible. Early in training, the model learns to re-solve more often (to avoid bad paths). As it gets better, it re-solves a bit less and gives correct answers more often.
Why does this matter?
- Smarter, more flexible problem solving: Just like a good student, the model learns when to stop and rethink instead of forcing a shaky plan to the end. This leads to higher accuracy and clearer reasoning.
- Less “overthinking” and “underthinking”: The model wastes fewer steps and avoids making up confused explanations just to produce an answer.
- Better use of extra compute: If you can afford more attempts at test time, Re2 gets more benefit from them, giving stronger final answers.
- Works across tasks and model sizes: The approach helps different kinds of models (small to medium size, base or instruction-tuned) on math and science problems.
A simple trade-off to know: when you only allow a tiny number of attempts at test time, the model might spend some of those attempts restarting, which can slightly lower accuracy compared to a method that always forces a final answer. But as soon as you allow a few more attempts, Re2 tends to pull ahead.
In short, this paper shows that teaching an AI to recognize a bad plan and start over—rather than blindly pushing forward—makes it a stronger, more reliable reasoner.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, focused list of what remains missing, uncertain, or unexplored in the paper, framed as concrete items future work can address:
- Calibration and reliability of the resolve-reward estimator: the expected “re-solve-from-scratch” accuracy is approximated using out-of-group samples within the same batch; its variance, bias, and sensitivity to n/m (prefixes/continuations), sampling temperature, and dataset difficulty are not quantified.
- Theoretical guarantees are absent: no analysis of convergence, optimality, or avoidance of degenerate policies (e.g., always resolve or never resolve) under the proposed three-way reward and group-wise advantage scheme.
- Impact of hyperparameters is unstudied: no ablations on n, m, clipping thresholds, learning rate, maximum redo rounds R, or the prefix truncation distribution (uniform up to 0.8), leaving optimal settings and robustness unclear.
- Reward-signal coupling to batch structure: using out-of-group continuations to value resolve actions may leak cross-sample information, introduce non-stationary targets, or encourage batch-dependent behaviors; scaling this estimator to larger batches or distributed training is not examined.
- Filtering “degenerate groups” (all identical outcomes) may bias learning by discarding hard or easy prefixes; the effect on stability, sample efficiency, and generalization is not analyzed.
- Compute fairness and efficiency are under-specified: comparisons mix samples and redo attempts without controlling for total tokens, latency, or wall-clock time; a token-budget–controlled evaluation and break-even analysis of accuracy vs. compute is missing.
- Evaluation protocol inconsistencies: single-sample evaluation restarts until a final answer is produced, while test-time scaling counts redo attempts as samples (reducing valid answers for majority vote); the impact of this discrepancy on reported gains is not quantified.
- Token cost and latency overheads are unreported: average tokens per solved problem, distribution of redo counts, and the tail behavior (e.g., long redo chains) are not provided.
- Termination criteria and failure modes at inference are not explored: risks of resolve loops, excessive restarts, or failure to ever produce an answer (and safeguards against them) remain open.
- Re-solving only restarts from scratch; there is no mechanism to backtrack to a good earlier prefix or salvage partial progress. How resolve compares to learned backtracking or local repair is not evaluated.
- Generality beyond verifiable-answer math is unclear: the method assumes verifiable 0/1 rewards (often integer answers). How to extend to open-ended, long-form, or partially verifiable tasks (proofs, explanations, code with flaky tests) is not addressed.
- Domain and metric coverage are limited: apart from math and one scientific QA set (GPQA-Diamond), broader domains (e.g., coding, multi-hop QA, planning) and process-level metrics (faithfulness, logical consistency, step validity) are not assessed.
- GPQA reward specification is unclear: how correctness is verified (e.g., multiple choice vs. free-form, parsing) and how redo is detected/valued on non-numeric tasks are not detailed.
- Prompt/template dependence is not quantified: the reliance on a specialized “redo” prompting template and textual markers raises questions about robustness across prompts, instruction styles, and languages; multilingual generalization is untested.
- Model-scale limits: results are on 3B–14B models; whether gains hold, diminish, or change qualitatively for larger LLMs remains unknown.
- Baseline breadth is narrow: comparisons focus on DAPO; stronger baselines (e.g., GRPO, VAPO, DLER/length-penalty RL), test-time strategies (self-consistency variants, tree search), and backtracking/critique-based methods are not included.
- Interaction with SFT preconditioning is unstudied: Re2 uses “pure RL” without SFT; whether combining with SFT (or verifier/process rewards) yields better or more stable training is open.
- Confidence and introspection signals are not analyzed: what features trigger resolve, how well confidence is calibrated, and whether the model learns reliable early-warning signals remain unexplored.
- Data contamination and deduplication are insufficiently audited: while AIME25 is post-training, the AoPS-derived training set may overlap with other evals; a thorough contamination check is not reported.
- Robustness to distribution shift and adversarial inputs is not evaluated: whether resolve helps or hurts under shifts, noisy statements, or adversarial prompts is unknown.
- Choice of R (max redo rounds) is heuristic: the effect of R on training dynamics, inference behavior, and compute/accuracy trade-offs is not characterized; the derivation of the resolve reward (with R) is deferred to the appendix without empirical validation.
- Training on prefixes truncated to ≤80% omits near-complete chains: the model may not learn when to resolve late in a solution; generalization to late-stage resolve decisions is uncertain.
- Detection/parsing of redo actions is brittle: reliance on string cues (“redo the question”) and boxed answers may fail under varied formatting or languages; the robustness of action parsing and the error rate are not quantified.
- Seed variance and statistical rigor are limited: number of runs, confidence intervals, and significance testing methodology (beyond p<0.05) are not detailed; reproducibility across seeds and hardware is unclear.
- Human-centric impacts are unmeasured: frequent restarts may degrade user experience or perceived competence; user preference, alignment with RLHF signals, and UX-informed policies for resolve are open questions.
Practical Applications
Below is a focused synthesis of practical, real-world applications enabled by the paper’s Re2 framework (reinforcement learning with re-solving), which trains LLMs to abandon unproductive chains of thought and restart when needed. Applications are grouped by deployment horizon and annotated with sectors, candidate tools/workflows, and key assumptions/dependencies.
Immediate Applications
- Software engineering — Code generation and debugging assistants
- What: Improve pass@1 and pass@k by allowing the model to abandon flawed partial implementations and “re-solve” from scratch when unit tests, static analyzers, or self-checks fail.
- Tools/Workflow: IDE plugins (VS Code/JetBrains) with Re2-trained backends; integration with test-driven synthesis pipelines; CI bots that trigger re-solve on early anti-pattern detection.
- Assumptions/Dependencies: Availability of verifiable rewards via unit tests and linters; tolerance for increased inference tokens; access to RL training pipeline and compute.
- Data/analytics — SQL generation and data transformation in ETL/ELT
- What: Restart query/transform synthesis when early prefixes fail schema checks or violate data quality constraints.
- Tools/Workflow: Data platform copilots (dbt, Snowflake, BigQuery) with “resolve-aware” policies; automatic reruns gated by schema validators.
- Assumptions/Dependencies: Reliable verifiers (schema checks, sample evaluations); additional inference budget.
- Customer support and process automation — Agentic workflows with restart-aware planning
- What: For multi-step procedures (returns, refunds, troubleshooting), agents abandon plans when early steps contradict business rules or fail guardrails, then replan.
- Tools/Workflow: Agent frameworks (LangChain, AutoGen) with a “Resolve” action and critics; logging and analytics to track resolve rates.
- Assumptions/Dependencies: Clear programmatic verifiers (policy engines/rule checks); careful calibration to avoid excessive restarts.
- Education — Math tutoring and stepwise feedback
- What: Tutors that explicitly backtrack from flawed lines of reasoning and model metacognition (knowing when to restart), improving solution quality and student understanding.
- Tools/Workflow: Math/logic problem solvers (GSM8K/AIME-like); UI that visualizes restarts; configurable “re-solve” thresholds.
- Assumptions/Dependencies: Domains with checkable answers; alignment with pedagogical norms.
- Scientific and technical Q&A — Verified question answering
- What: For grad-level STEM questions where answers are checkable (symbolic/numeric), models restart if early derivations contradict constraints.
- Tools/Workflow: QA systems with CAS/SMT solvers or numeric evaluators as verifiers; majority-vote ensembles that include re-solve outputs.
- Assumptions/Dependencies: Strong verifiers or gold answers; increased test-time sampling to realize gains.
- Documentation and report generation — Consistency-first drafting
- What: When early sections create contradictions (e.g., mismatched figures, tables, or definitions), the system re-solves structure rather than patching locally.
- Tools/Workflow: Structured authoring assistants with validators (cross-reference checks, figure/table consistency); “restart outline” capability.
- Assumptions/Dependencies: Automated consistency checks; willingness to trade time for correctness.
- Cloud inference offerings — “Resolve mode” as a serving option
- What: Expose a deployment flag that enables resolve-aware sampling with budget caps, delivering higher accuracy when users can afford extra tokens.
- Tools/Workflow: Inference gateways with compute budgets (max redo rounds R); logging of resolve/correct/incorrect proportions for SLAs.
- Assumptions/Dependencies: User acceptance of higher latency; robust budget control and guardrails.
- Academic research — Training/evaluation methodology for reasoning
- What: Use Re2’s groupwise advantage and resolve-reward design to train small/mid-size models on verifiable tasks; analyze early-prefix quality.
- Tools/Workflow: Adoption of the provided open-source implementation; benchmark suites (AIME, AMC, GSM8K, GPQA) with pass@k and resolve metrics.
- Assumptions/Dependencies: Reward verifiability; reproducible compute setup; long-sequence capability (8k–16k tokens).
- Governance/safety checklists — “Don’t bluff” operational rule
- What: In high-stakes or policy-constrained settings (e.g., internal policy Q&A), prefer restart/abstain over low-confidence guesses.
- Tools/Workflow: Embed resolve/abstain policies in safety layers; report “redo rate” and “abstain rate” as quality metrics.
- Assumptions/Dependencies: Institutional willingness to accept abstentions; clarity on confidence thresholds.
Long-Term Applications
- Healthcare — Clinical reasoning with structured re-analysis
- What: Decision support that restarts differential reasoning when early steps contradict labs/guidelines; avoids compounding initial errors.
- Tools/Workflow: Resolve-aware CDSS connected to evidence checkers, guideline engines; human-in-the-loop verification.
- Assumptions/Dependencies: Rigorous validation and regulatory approval; verifiers for medical logic; strong accountability and audit trails.
- Robotics and embodied agents — Replanning under failure signals
- What: Detect flawed plan prefixes from sensor/model mismatches and replan early rather than persisting.
- Tools/Workflow: Hierarchical planners where LLMs plan and RL controllers execute; resolve conditioned on state-estimation/constraint checks.
- Assumptions/Dependencies: Reliable simulators or execution monitors as verifiers; real-time compute constraints.
- Finance — Quant analysis and reporting with self-correction
- What: Restart analyses when early assumptions break (e.g., inconsistent cash flows, invalid model constraints) to reduce “confidently wrong” reports.
- Tools/Workflow: Integration with validation suites (balance checks, reconciliation); audit logs that capture resolve cycles.
- Assumptions/Dependencies: Domain-specific verifiers; governance acceptance of dynamic recomputation costs.
- Enterprise planning and forecasting — Scenario generation with restart-aware logic
- What: Re-solve demand/supply plans when interim constraints or KPIs are violated, reducing local fixes that increase systemic error.
- Tools/Workflow: Resolve-aware planning agents with constraint-programming verifiers; policy-tuned redo budgets.
- Assumptions/Dependencies: Formalized constraint sets; compute budgets; change management for planners.
- Multimodal reasoning — Vision-language and tool-use with early reset
- What: For tasks like diagram understanding or chart QA, re-run perception + reasoning when early extractions conflict with downstream checks.
- Tools/Workflow: Pipelines combining OCR/vision models with symbolic validators; resolve actions that re-extract or re-parse input.
- Assumptions/Dependencies: Cross-modal verifiers; cost of reprocessing images/videos.
- Open-ended generation with learned critics — Resolve without hard verifiers
- What: Extend re-solve to creative or less-structured tasks using critic models (learned verifiers) that estimate “expected success from scratch.”
- Tools/Workflow: Critique/rewrite loops where “resolve” triggers a full rewrite; reward shaping via preference or process rewards.
- Assumptions/Dependencies: High-quality critics; risk of critic bias; need for careful RLHF/RLVR design.
- Standards and policy — “Restart over guessing” requirements for regulated AI
- What: Sector guidelines that encourage restart/abstain behaviors instead of low-confidence answers in safety-critical uses (health, legal, aviation).
- Tools/Workflow: Conformance tests measuring resolve rates, early-prefix degradation, and compute budget adherence.
- Assumptions/Dependencies: Consensus on metrics; industry adoption; enforcement mechanisms.
- Compute orchestration — Dynamic test-time scaling at the system level
- What: Schedulers that allocate more compute to samples that trigger resolve (indicating hard cases), optimizing fleet-wide cost–accuracy tradeoffs.
- Tools/Workflow: Serving layers with per-request adaptive budgets; telemetry on resolve-driven token use vs. accuracy.
- Assumptions/Dependencies: Accurate hardness/resolve signals; cost controls; fair-use constraints.
- Small-model parity — Teaching compact models to match larger reasoning performance
- What: Use Re2 to enhance smaller models’ reasoning by avoiding bad trajectories, improving accuracy per token.
- Tools/Workflow: RL training for 1–8B parameter models with verifiable tasks (math/code); distillation of re-solve behavior.
- Assumptions/Dependencies: Quality and diversity of verifiable training data; stable RL training.
- Hybrid search–reason systems — Integrating re-solve with tree search and majority voting
- What: Combine early restart policies with MCTS/tree-of-thought search to prune bad branches aggressively, improving sample efficiency.
- Tools/Workflow: Search controllers that score prefixes and trigger re-solve; process-reward-guided exploration.
- Assumptions/Dependencies: Reliable scoring functions; engineering for long-context, multi-branch inference.
Notes on feasibility across applications:
- Verifiable rewards are the main enabler today (math, code, structured QA). For open-ended tasks, learned critics or preference models substitute but add uncertainty.
- Re2 tends to consume more tokens at inference due to redo attempts; benefits grow with test-time sampling and may be muted when strict latency/compute caps apply.
- Domain transfer beyond math/science requires domain-specific verifiers, critics, or process rewards.
- Operationally, resolve behavior must be tuned to avoid infinite retries and ensure user-facing responsiveness (e.g., cap redo rounds R and budget tokens).
Glossary
- Advantage (group-wise): In RL, a normalized estimate of how much better an action is than a baseline within a group of samples, used to weight policy updates. "we compute group-wise advantages and update the policy parameters following DAPO (Yu et al., 2025)."
- AIME (American Invitational Mathematics Examination): A challenging U.S. high-school mathematics contest used as a reasoning benchmark for LLMs. "AIME 2024 (MAA Committees) contains 30 challenging problems"
- AMC (American Mathematics Competitions): A set of math contests (e.g., AMC 10/12) providing benchmark problems of lower difficulty than AIME. "AMC 2023 (AI-MO, 2024) consists of 40 problems covering algebra, geometry, number theory, and combinatorics."
- Chain-of-thought (CoT): A prompting and reasoning format where the model generates intermediate steps before the final answer. "such models still tend to generate unnecessary and low-quality steps in their chain-of-thought (CoT),"
- Clipping thresholds: Bounds on the importance-sampling ratio in policy optimization to stabilize training updates. "The clipping thresholds Elow and Ehigh are hyperparameters used to bound the importance sampling ratio for stable optimization."
- Data contamination: Unintended overlap between training/evaluation data that inflates reported performance. "it reduces the risk of contamination from pretraining or post-training data."
- DAPO: An RLVR method (Group-based RL) for training reasoning-capable LLMs via sampled trajectories and correctness rewards. "recent RLVR methods such as DAPO (Yu et al., 2025)"
- End-reward paradigm (0/1 end-reward): An RL setup where only the final outcome receives reward (1 if correct, 0 otherwise), without shaping intermediate steps. "the standard 0/1 end-reward paradigm in RLVR."
- GPQA-Diamond: The most difficult subset of the GPQA benchmark with graduate-level science questions. "In our experiments, we use the highest-quality subset, GPQA-Diamond,"
- GRPO: Grouped RL algorithm for process optimization in LLMs, related to RLVR-style training. "Recent RLVR methods such as GRPO (Shao et al., 2024) and DAPO (Yu et al., 2025)"
- GSM8K: A benchmark of grade-school math word problems for evaluating basic numerical reasoning. "GSM8K (Cobbe et al., 2021) is a curated dataset of 1,319 elementary-level math word problems."
- Importance sampling ratio: The ratio between current and behavior policies used to reweight off-policy samples in policy-gradient methods. "used to bound the importance sampling ratio for stable optimization."
- Instruction-tuned (LLM): A model finetuned on instruction–response pairs to better follow user prompts. "Qwen2.5-7B-Instruct (Yang et al., 2024a) as a representative instruction-tuned LLM"
- Majority voting: An ensemble decoding strategy that selects the final answer most frequently produced across sampled generations. "majority voting (Wang et al., 2022)"
- Out-of-group (completions): Completions sampled from other groups (prefixes) used to estimate external success probabilities (e.g., for re-solving reward). "estimated using out-of- group CoT completions,"
- Pass@1 accuracy: The probability that the first (single-sample) attempt is correct; a standard LLM evaluation metric. "improve pass@1 accuracy by sampling multiple reasoning trajectories in parallel for each query"
- Prefix Truncation Ratio: The proportion of an original response kept when truncating to create a reasoning prefix for continuation experiments. "Prefix Truncation Ratio"
- Re-solving (Re2): A framework that lets an LLM abort an unpromising reasoning path and restart from scratch, trained via RL. "we introduce Reinforcement Learning with Re-solving (Re2),"
- Redo rounds: The maximum allowed number of times the model can restart its solution process during inference or training. "When at most R redo rounds are allowed,"
- Reinforcement learning with verifiable rewards (RLVR): RL that uses objective, automatically checkable signals (e.g., correctness) to reward model outputs. "Reinforcement learning with verifiable rewards (RLVR) has shown promise"
- Resolve action: The explicit action of abandoning the current reasoning trajectory to restart the solution process. "the reward under the resolve action,"
- Test-time compute: The computational budget spent during inference (e.g., more samples/steps) to improve performance. "through scaling test-time compute"
- Test-time scaling: Increasing inference-time computation (longer CoTs, more samples) to boost model accuracy. "Test-time scaling of DAPO and Re2"
- Tree search: A decoding strategy exploring multiple branching reasoning paths to find higher-quality solutions. "tree search (Hao et al., 2023; Zhang et al., 2024)"
Collections
Sign up for free to add this paper to one or more collections.