Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reinforcing General Reasoning without Verifiers

Published 27 May 2025 in cs.LG and cs.CL | (2505.21493v1)

Abstract: The recent paradigm shift towards training LLMs using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.

Summary

  • The paper presents VeriFree, a novel RL framework that improves LLMs' general reasoning by directly maximizing the likelihood of reference answers without explicit verifiers.
  • It leverages reduced gradient variance and Reinforce Leave-One-Out (RLOO) to achieve faster convergence and superior performance on benchmarks like MMLU-Pro and SuperGPQA.
  • The method demonstrates transferable reasoning skills across domains, reducing reliance on verifier models and cutting computational overhead.

This paper introduces VeriFree, a novel verifier-free reinforcement learning (RL) method designed to enhance the general reasoning capabilities of LLMs without relying on explicit answer verifiers (2505.21493). The authors address the limitations of existing DeepSeek-R1-Zero-style RL, which excels in domains like math and coding where rule-based answer verification is feasible but struggles with general reasoning tasks (e.g., chemistry, law, biology) where such verification is difficult or impossible. While model-based verifiers (using another LLM) are a workaround, they introduce dependencies, potential for reward hacking, and computational overhead.

VeriFree bypasses the need for any verifier by directly maximizing the probability of generating the reference answer given a question and a model-generated reasoning trace. The core idea is to:

  1. Have the LLM (policy πθ\pi_\theta) generate a reasoning trace cc in response to a question qq.
  2. Concatenate this generated reasoning trace cc with the known reference answer aa^\star from the dataset.
  3. Evaluate the likelihood πθ(aq,c)\pi_\theta(a^\star | q, c) of the reference answer aa^\star conditioned on the question qq and the generated reasoning trace cc.
  4. This likelihood πθ(aq,c)\pi_\theta(a^\star | q, c) serves as the reward signal.

The VeriFree objective function is JVeriFree(θ;q,a)=Ecπθ(q)[πθ(aq,c)]J_{\text{VeriFree}}(\theta; q, a^\star) = E_{c \sim \pi_\theta(\cdot|q)}[\pi_\theta(a^\star|q,c)]. This is shown to be equivalent in expectation to the verifier-based objective $J_{\text{Verifier}}(\theta; q, a^\star) = E_{c \sim \pi_\theta(\cdot|q)} E_{a \sim \pi_\theta(\cdot|q,c)}[\mathds{1}_{\{a \equiv a^\star\}}]$ when there's a unique correct answer string.

The gradient estimator for VeriFree is derived as:

θJVeriFree(θ;q,a)=Ecπθ(q)[RVeriFree(q,a,c)[θlogπθ(cq)+θlogπθ(aq,c)]]\nabla_\theta J_{\text{VeriFree}}(\theta; q, a^\star) = E_{c \sim \pi_\theta(\cdot|q)}\bigg[R_{\text{VeriFree}}(q, a^\star, c)\big[\nabla_\theta\log\pi_\theta(c|q) + \nabla_\theta\log\pi_\theta(a^\star|q,c)\big]\bigg]

where RVeriFree(q,a,c)=πθ(aq,c)R_{\text{VeriFree}}(q, a^\star, c) = \pi_\theta(a^\star|q,c). The first term θlogπθ(cq)\nabla_\theta\log\pi_\theta(c|q) is a policy gradient for the reasoning trace, and the second term θlogπθ(aq,c)\nabla_\theta\log\pi_\theta(a^\star|q,c) is a reward-weighted supervised learning term for the reference answer.

A key theoretical advantage highlighted is variance reduction. Theorem 1 states that the variance of the VeriFree gradient estimator is less than or equal to that of the verifier-based estimator, a result of Rao-Blackwellization by analytically marginalizing out the answer sampling step. The final on-policy gradient estimator incorporates RLOO (Reinforce Leave-One-Out) for further variance reduction:

θJVeriFree(θ)=1Gi=1G[Aiθlogπθ(ciq)+Riθlogπθ(aq,ci)]\nabla_\theta J_{\text{VeriFree}}(\theta) = \frac{1}{G} \sum_{i=1}^G \left[A_i\cdot\nabla_\theta\log\pi_\theta(c_i | q) + R_i\cdot\nabla_\theta\log\pi_\theta(a^\star | q, c_i)\right]

where ciπθ(q)c_i \sim \pi_\theta(\cdot|q), Ri=πθ(aq,ci)R_i = \pi_\theta(a^\star |q, c_i), and Ai=Ri1G1jiπθ(aq,cj)A_i = R_i - \frac{1}{G-1}\sum_{j\neq i} \pi_\theta(a^\star |q, c_j).

A practical implementation challenge addressed is tokenization at the "patching point" where the generated reasoning trace cc meets the reference answer aa^\star. To ensure consistent tokenization, the authors define the end of cc at the token corresponding to "<answer" (without the closing ">"), which is equivalent to using "<answer" as a stop word during sampling.

Experiments and Results:

  • Models: Qwen3 base models (1.7B, 4B, 8B parameters).
  • Training Data: "WebData," a curated dataset of ~61,000 samples from WebInstruct, filtered for quality and answer length.
  • Evaluation Benchmarks:
    • General Reasoning: MMLU-Pro, SuperGPQA, GPQA.
    • Math Reasoning: MATH-500, OlympiadBench, Minerva Math, GSM8K, AMC, AIME24.
  • Baselines:
    • "Verifier": A verifier-based approach using a fine-tuned Qwen2.5-Math-1.5B model as the verifier, optimized with Dr.GRPO.
    • Qwen3 base and instruct models, and other publicly available RL-tuned models.

Key Findings:

  1. Improved General Reasoning: VeriFree significantly improved the general reasoning capabilities of base LLMs on MMLU-Pro (12%-40% average accuracy gain) and SuperGPQA, often matching or surpassing instruct models and the "Verifier" baseline.
  2. Better Learning Efficiency: VeriFree demonstrated faster convergence and higher final accuracy compared to the verifier-based baseline, attributed to reduced gradient variance. (See Figure 1 for training dynamics).
  3. Model Confidence as Proxy: A strong positive correlation (ρ=0.82\rho = 0.82) was found between MMLU-Pro accuracy and the model's average confidence πθ(aq,c)\pi_\theta(a^\star|q,c) during training, suggesting this confidence is a good proxy for reasoning capability.
  4. Transferable Reasoning Skills: A model trained with VeriFree on non-math data showed improved performance on math benchmarks, indicating that VeriFree learns generalizable reasoning skills.
  5. Ablation Studies:
    • The proposed tokenization-aware splitting strategy for reasoning traces was crucial for stable optimization.
    • RLOO significantly contributed to performance.
    • Incorporating an equivalence class of correct answers (instead of a single reference answer) showed slight performance improvements in math tasks, suggesting a minor limitation and area for future work.

Comparison to Existing Verifier-Free Approaches:

The paper distinguishes VeriFree from variational inference-based methods like JLB (Tang et al., 25 Mar 2025) and LaTRO (Chen et al., 2024). While these methods also treat the reasoning trace as a latent variable, VeriFree's objective is argued to be closer to the original verifier-based objective (under a single-correct-answer assumption). VeriFree weights the reference answer term by πθ(aq,c)\pi_\theta(a^\star|q,c), unlike JLB and LaTRO which use a weight of 1, potentially avoiding reinforcement of mismatches between flawed reasoning and correct answers.

Implementation Pseudocode Comparison:

Verifier-based (R1-Zero) VeriFree (Ours)
Model generates reasoning trace cc and answer aa. Model generates reasoning trace cc.
Extract the answer aa. Patch in the correct answer aa^\star.
Check answer using a verifier. Evaluate probability πθ(aq,c)\pi_\theta(a^\star|q,c).
Reward RVerifier=1R_{\text{Verifier}} = 1 if correct, 0 otherwise. Reward RVeriFree=πθ(aq,c)R_{\text{VeriFree}} = \pi_\theta(a^\star|q,c).
Train with θJVerifier\nabla_\theta J_{\text{Verifier}}. Train with θJVeriFree\nabla_\theta J_{\text{VeriFree}}.

Conclusion:

VeriFree offers a practical and effective method for extending R1-Zero-style RL training to general reasoning domains where verifiers are unavailable or costly. It achieves this by directly optimizing the likelihood of the reference answer given a generated reasoning trace, leading to comparable or superior performance to verifier-based methods with reduced computational requirements and improved learning efficiency due to lower variance gradients. The work provides a new perspective for LLM RL and a path towards building more general-purpose reasoners.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 80 likes about this paper.