Reinforcing General Reasoning without Verifiers

Published 27 May 2025 in cs.LG and cs.CL | (2505.21493v1)

Abstract: The recent paradigm shift towards training LLMs using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.

Abstract PDF Upgrade to Chat

Summary

The paper presents VeriFree, a novel RL framework that improves LLMs' general reasoning by directly maximizing the likelihood of reference answers without explicit verifiers.
It leverages reduced gradient variance and Reinforce Leave-One-Out (RLOO) to achieve faster convergence and superior performance on benchmarks like MMLU-Pro and SuperGPQA.
The method demonstrates transferable reasoning skills across domains, reducing reliance on verifier models and cutting computational overhead.

This paper introduces VeriFree, a novel verifier-free reinforcement learning (RL) method designed to enhance the general reasoning capabilities of LLMs without relying on explicit answer verifiers (2505.21493). The authors address the limitations of existing DeepSeek-R1-Zero-style RL, which excels in domains like math and coding where rule-based answer verification is feasible but struggles with general reasoning tasks (e.g., chemistry, law, biology) where such verification is difficult or impossible. While model-based verifiers (using another LLM) are a workaround, they introduce dependencies, potential for reward hacking, and computational overhead.

VeriFree bypasses the need for any verifier by directly maximizing the probability of generating the reference answer given a question and a model-generated reasoning trace. The core idea is to:

Have the LLM (policy $\pi_\theta$ ) generate a reasoning trace $c$ in response to a question $q$ .
Concatenate this generated reasoning trace $c$ with the known reference answer $a^\star$ from the dataset.
Evaluate the likelihood $\pi_\theta(a^\star | q, c)$ of the reference answer $a^\star$ conditioned on the question $q$ and the generated reasoning trace $c$ .
This likelihood $\pi_\theta(a^\star | q, c)$ serves as the reward signal.

The VeriFree objective function is $J_{\text{VeriFree}}(\theta; q, a^\star) = E_{c \sim \pi_\theta(\cdot|q)}[\pi_\theta(a^\star|q,c)]$ . This is shown to be equivalent in expectation to the verifier-based objective $J_{\text{Verifier}}(\theta; q, a^\star) = E_{c \sim \pi_\theta(\cdot|q)} E_{a \sim \pi_\theta(\cdot|q,c)}[\mathds{1}_{\{a \equiv a^\star\}}]$ when there's a unique correct answer string.

The gradient estimator for VeriFree is derived as:

$\nabla_\theta J_{\text{VeriFree}}(\theta; q, a^\star) = E_{c \sim \pi_\theta(\cdot|q)}\bigg[R_{\text{VeriFree}}(q, a^\star, c)\big[\nabla_\theta\log\pi_\theta(c|q) + \nabla_\theta\log\pi_\theta(a^\star|q,c)\big]\bigg]$

where $R_{\text{VeriFree}}(q, a^\star, c) = \pi_\theta(a^\star|q,c)$ . The first term $\nabla_\theta\log\pi_\theta(c|q)$ is a policy gradient for the reasoning trace, and the second term $\nabla_\theta\log\pi_\theta(a^\star|q,c)$ is a reward-weighted supervised learning term for the reference answer.

A key theoretical advantage highlighted is variance reduction. Theorem 1 states that the variance of the VeriFree gradient estimator is less than or equal to that of the verifier-based estimator, a result of Rao-Blackwellization by analytically marginalizing out the answer sampling step. The final on-policy gradient estimator incorporates RLOO (Reinforce Leave-One-Out) for further variance reduction:

$\nabla_\theta J_{\text{VeriFree}}(\theta) = \frac{1}{G} \sum_{i=1}^G \left[A_i\cdot\nabla_\theta\log\pi_\theta(c_i | q) + R_i\cdot\nabla_\theta\log\pi_\theta(a^\star | q, c_i)\right]$

where $c_i \sim \pi_\theta(\cdot|q)$ , $R_i = \pi_\theta(a^\star |q, c_i)$ , and $A_i = R_i - \frac{1}{G-1}\sum_{j\neq i} \pi_\theta(a^\star |q, c_j)$ .

A practical implementation challenge addressed is tokenization at the "patching point" where the generated reasoning trace $c$ meets the reference answer $a^\star$ . To ensure consistent tokenization, the authors define the end of $c$ at the token corresponding to "<answer" (without the closing ">"), which is equivalent to using "<answer" as a stop word during sampling.

Experiments and Results:

Models: Qwen3 base models (1.7B, 4B, 8B parameters).
Training Data: "WebData," a curated dataset of ~61,000 samples from WebInstruct, filtered for quality and answer length.
Evaluation Benchmarks:
- General Reasoning: MMLU-Pro, SuperGPQA, GPQA.
- Math Reasoning: MATH-500, OlympiadBench, Minerva Math, GSM8K, AMC, AIME24.
Baselines:
- "Verifier": A verifier-based approach using a fine-tuned Qwen2.5-Math-1.5B model as the verifier, optimized with Dr.GRPO.
- Qwen3 base and instruct models, and other publicly available RL-tuned models.

Key Findings:

Improved General Reasoning: VeriFree significantly improved the general reasoning capabilities of base LLMs on MMLU-Pro (12%-40% average accuracy gain) and SuperGPQA, often matching or surpassing instruct models and the "Verifier" baseline.
Better Learning Efficiency: VeriFree demonstrated faster convergence and higher final accuracy compared to the verifier-based baseline, attributed to reduced gradient variance. (See Figure 1 for training dynamics).
Model Confidence as Proxy: A strong positive correlation ( $\rho = 0.82$ ) was found between MMLU-Pro accuracy and the model's average confidence $\pi_\theta(a^\star|q,c)$ during training, suggesting this confidence is a good proxy for reasoning capability.
Transferable Reasoning Skills: A model trained with VeriFree on non-math data showed improved performance on math benchmarks, indicating that VeriFree learns generalizable reasoning skills.
Ablation Studies:
- The proposed tokenization-aware splitting strategy for reasoning traces was crucial for stable optimization.
- RLOO significantly contributed to performance.
- Incorporating an equivalence class of correct answers (instead of a single reference answer) showed slight performance improvements in math tasks, suggesting a minor limitation and area for future work.

Comparison to Existing Verifier-Free Approaches:

The paper distinguishes VeriFree from variational inference-based methods like JLB (Tang et al., 25 Mar 2025) and LaTRO (Chen et al., 2024). While these methods also treat the reasoning trace as a latent variable, VeriFree's objective is argued to be closer to the original verifier-based objective (under a single-correct-answer assumption). VeriFree weights the reference answer term by $\pi_\theta(a^\star|q,c)$ , unlike JLB and LaTRO which use a weight of 1, potentially avoiding reinforcement of mismatches between flawed reasoning and correct answers.

Implementation Pseudocode Comparison:

Verifier-based (R1-Zero)	VeriFree (Ours)
Model generates reasoning trace $c$ and answer $a$ .	Model generates reasoning trace $c$ .
Extract the answer $a$ .	Patch in the correct answer $a^\star$ .
Check answer using a verifier.	Evaluate probability $\pi_\theta(a^\star\|q,c)$ .
Reward $R_{\text{Verifier}} = 1$ if correct, 0 otherwise.	Reward $R_{\text{VeriFree}} = \pi_\theta(a^\star\|q,c)$ .
Train with $\nabla_\theta J_{\text{Verifier}}$ .	Train with $\nabla_\theta J_{\text{VeriFree}}$ .

Conclusion:

VeriFree offers a practical and effective method for extending R1-Zero-style RL training to general reasoning domains where verifiers are unavailable or costly. It achieves this by directly optimizing the likelihood of the reference answer given a generated reasoning trace, leading to comparable or superior performance to verifier-based methods with reduced computational requirements and improved learning efficiency due to lower variance gradients. The work provides a new perspective for LLM RL and a path towards building more general-purpose reasoners.

Markdown