ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

Published 1 May 2026 in cs.LG and cs.CL | (2605.00380v1)

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of LLMs but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes negative sample projection Residual Reinforcement Learning (ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically link Lazy Likelihood Displacement (LLD) to negative-positive head-gradient interference and derive a single-forward proxy that upper-bounds representation alignment to guide conservative advantage reweighting. ResRL then projects negative-token hidden representations onto an SVD-based low-rank positive subspace and uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4\% in Avg@16 and 7.0\% in Pass@128. Code is available at https://github.com/1229095296/ResRL.git.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces ResRL, which leverages projection residuals to decouple negative gradients from positive semantic structures in RLVR.
It employs SVD/PCA-based estimation of positive subspaces for token-wise adjustment, enhancing performance across various benchmarks.
Empirical results show significant improvements in mathematical reasoning, code generation, and agentic tasks, validating robust multi-step planning.

Negative Sample Projection Residual Reinforcement Learning: Decoupling Gradient Interference in RLVR

Motivation and Theoretical Foundations

Reinforcement Learning with Verifiable Rewards (RLVR) is the established paradigm for post-training LLMs to enhance multi-step reasoning, exemplified by Group-Relative Policy Optimization (GRPO) and extensions such as DeepSeek-R1. However, standard RLVR suffers from reduced output diversity, primarily due to over-incentivization of positive reward trajectories, leading to mode collapse and diminished Pass@k performance. Negative Sample Reinforcement (NSR) was introduced to combat this by upweighting negative gradients, but in practice, it introduces gradient interference because positive and negative samples share substantial semantic distribution—token overlap spanning syntactic elements and intermediate reasoning steps. NSR thereby can penalize valid token distributions that are integral to correct completions.

ResRL is motivated by the need to robustly disentangle positive and negative response optimization without sacrificing semantic diversity. The core theoretical framework establishes a rigorous link between Lazy Likelihood Displacement (LLD)—the failure to increase log-likelihood of correct trajectories in RLVR—and gradient interference between positive and negative head outputs. Using geometric decomposition, ResRL demonstrates that output-head gradient inner products factor into logit-space and representation-space components, with hidden states in the penultimate layer serving as proxies for semantic structure.

Methodological Advances: Projection Residuals for Gradient Decoupling

The main methodological innovation is the introduction of negative-sample projection residuals, computed by projecting negative-token representations onto a low-rank positive subspace (estimated via SVD/PCA from LayerNorm-centered positive hidden states). The orthogonal-complement energy (projection residual) quantifies deviation from positive semantic structure and forms the basis for token-wise gradient reweighting. Negative gradients are modulated to penalize only orthogonal, error-specific directions, while protecting tokens aligned with the positive subspace from suppression. This provides a theoretically motivated conservative upper bound for representation alignment, mitigating deleterious effects of LLD.

Algorithmically, ResRL employs:

Positive subspace estimation via truncated SVD/PCA using sampled positive tokens, capturing dominant semantic directions.
Token-wise projection residual computation for negatives, mapped to normalized quantile scores for robust gating.
Dynamic adjustment of NSR weights so that destructive negative overlap is attenuated, offering precision-diversity tradeoff and reducing variance in long-horizon optimization.

The use of length-scaled reward mechanisms ensures stability against verbosity exploitation, while group-relative quantile normalization maintains consistency across variable-length rollouts.

Empirical Results and Ablations

ResRL was evaluated on twelve benchmarks across mathematics, code generation, agent tasks, and function calling, using several Qwen backbone variants (1.7B–32B). Key results include:

Mathematical Reasoning: On Qwen3-4B, ResRL outperforms NSR on Avg@16 by 9.4%, and Pass@128 by 7.0%. For Qwen3-8B, ResRL achieves best overall accuracy, with critical improvements concentrated on challenging AIME datasets (up to 27.8% boost over FlowRL).
Code Generation: Sets new SOTA on CodeForces—ResRL achieves a rating of 1469.5, surpassing NSR by 9.6%, and improves percentile ranking by 13.9%. HumanEval+ is saturated, but ResRL remains slightly dominant.
Agentic Tasks: On ALFWorld using Qwen2.5-7B-Instruct, ResRL reaches 86.7% success rate, outperforming PPO and EMPG by significant margins. WebShop success rate is also improved.
Function Calling: ResRL demonstrates highest Multi-Turn OA on BFCL, outperforming ResT and NSR, especially on error-sensitive subsets (Miss Func and Miss Param), confirming robust multi-step tool-use planning.

Comprehensive ablations validate that intermediate-rank positive subspaces (e.g., $k=64$ ) produce optimal balance between discrimination and protection, and that penultimate-layer representations best capture robust semantic signals. LayerNorm and centering are indispensable for gradient stability. Removing KL penalty does not destabilize ResRL due to its intrinsic regularization from projection-based gating.

Practical and Theoretical Implications

ResRL directly addresses the long-standing challenge of destructive negative-positive overlap in policy optimization for RLVR. It establishes a rigorous mechanism for semantic decoupling, minimizing cross-sign interference while preserving generalization and output diversity. The projection-residual reweighting enables consistent performance gains in both reliability-centric (Avg@16, Pass@1) and diversity-centric (Pass@k) metrics across all tested domains.

Practically, ResRL’s modular design, low-rank efficiency, and independence from explicit KL penalty make it scalable and computationally tractable for long-horizon, high-budget RLVR regimes. Theoretical advances in conservative gradient bounding suggest future refinements in compositional reasoning and multi-task learning, as well as improved robustness against reward hacking and mode collapse.

Further developments may entail adaptive rank selection, integration with uncertainty-modulated exploration signals, and extension to hierarchical multi-turn agent paradigms. The underlying geometric principles could inform advanced architectures for disentangled skill acquisition and continual learning.

Conclusion

ResRL represents a principled token-level reinforcement paradigm that, via projection residuals and representation-space decoupling, delivers robust reasoning improvements under RLVR, without compromising generation diversity. Empirical superiority over GRPO, NSR, and FlowRL is achieved across mathematics, code, agentic, and function calling benchmarks, validating its efficacy. The approach suggests avenues for scalable RLVR, stable long-horizon optimization, and more reliable semantic control in LLM post-training (2605.00380).

Markdown Report Issue