- The paper introduces ResRL, which leverages projection residuals to decouple negative gradients from positive semantic structures in RLVR.
- It employs SVD/PCA-based estimation of positive subspaces for token-wise adjustment, enhancing performance across various benchmarks.
- Empirical results show significant improvements in mathematical reasoning, code generation, and agentic tasks, validating robust multi-step planning.
Negative Sample Projection Residual Reinforcement Learning: Decoupling Gradient Interference in RLVR
Motivation and Theoretical Foundations
Reinforcement Learning with Verifiable Rewards (RLVR) is the established paradigm for post-training LLMs to enhance multi-step reasoning, exemplified by Group-Relative Policy Optimization (GRPO) and extensions such as DeepSeek-R1. However, standard RLVR suffers from reduced output diversity, primarily due to over-incentivization of positive reward trajectories, leading to mode collapse and diminished Pass@k performance. Negative Sample Reinforcement (NSR) was introduced to combat this by upweighting negative gradients, but in practice, it introduces gradient interference because positive and negative samples share substantial semantic distributionโtoken overlap spanning syntactic elements and intermediate reasoning steps. NSR thereby can penalize valid token distributions that are integral to correct completions.
ResRL is motivated by the need to robustly disentangle positive and negative response optimization without sacrificing semantic diversity. The core theoretical framework establishes a rigorous link between Lazy Likelihood Displacement (LLD)โthe failure to increase log-likelihood of correct trajectories in RLVRโand gradient interference between positive and negative head outputs. Using geometric decomposition, ResRL demonstrates that output-head gradient inner products factor into logit-space and representation-space components, with hidden states in the penultimate layer serving as proxies for semantic structure.
Methodological Advances: Projection Residuals for Gradient Decoupling
The main methodological innovation is the introduction of negative-sample projection residuals, computed by projecting negative-token representations onto a low-rank positive subspace (estimated via SVD/PCA from LayerNorm-centered positive hidden states). The orthogonal-complement energy (projection residual) quantifies deviation from positive semantic structure and forms the basis for token-wise gradient reweighting. Negative gradients are modulated to penalize only orthogonal, error-specific directions, while protecting tokens aligned with the positive subspace from suppression. This provides a theoretically motivated conservative upper bound for representation alignment, mitigating deleterious effects of LLD.
Algorithmically, ResRL employs:
- Positive subspace estimation via truncated SVD/PCA using sampled positive tokens, capturing dominant semantic directions.
- Token-wise projection residual computation for negatives, mapped to normalized quantile scores for robust gating.
- Dynamic adjustment of NSR weights so that destructive negative overlap is attenuated, offering precision-diversity tradeoff and reducing variance in long-horizon optimization.
The use of length-scaled reward mechanisms ensures stability against verbosity exploitation, while group-relative quantile normalization maintains consistency across variable-length rollouts.
Empirical Results and Ablations
ResRL was evaluated on twelve benchmarks across mathematics, code generation, agent tasks, and function calling, using several Qwen backbone variants (1.7Bโ32B). Key results include:
- Mathematical Reasoning: On Qwen3-4B, ResRL outperforms NSR on Avg@16 by 9.4%, and Pass@128 by 7.0%. For Qwen3-8B, ResRL achieves best overall accuracy, with critical improvements concentrated on challenging AIME datasets (up to 27.8% boost over FlowRL).
- Code Generation: Sets new SOTA on CodeForcesโResRL achieves a rating of 1469.5, surpassing NSR by 9.6%, and improves percentile ranking by 13.9%. HumanEval+ is saturated, but ResRL remains slightly dominant.
- Agentic Tasks: On ALFWorld using Qwen2.5-7B-Instruct, ResRL reaches 86.7% success rate, outperforming PPO and EMPG by significant margins. WebShop success rate is also improved.
- Function Calling: ResRL demonstrates highest Multi-Turn OA on BFCL, outperforming ResT and NSR, especially on error-sensitive subsets (Miss Func and Miss Param), confirming robust multi-step tool-use planning.
Comprehensive ablations validate that intermediate-rank positive subspaces (e.g., k=64) produce optimal balance between discrimination and protection, and that penultimate-layer representations best capture robust semantic signals. LayerNorm and centering are indispensable for gradient stability. Removing KL penalty does not destabilize ResRL due to its intrinsic regularization from projection-based gating.
Practical and Theoretical Implications
ResRL directly addresses the long-standing challenge of destructive negative-positive overlap in policy optimization for RLVR. It establishes a rigorous mechanism for semantic decoupling, minimizing cross-sign interference while preserving generalization and output diversity. The projection-residual reweighting enables consistent performance gains in both reliability-centric (Avg@16, Pass@1) and diversity-centric (Pass@k) metrics across all tested domains.
Practically, ResRLโs modular design, low-rank efficiency, and independence from explicit KL penalty make it scalable and computationally tractable for long-horizon, high-budget RLVR regimes. Theoretical advances in conservative gradient bounding suggest future refinements in compositional reasoning and multi-task learning, as well as improved robustness against reward hacking and mode collapse.
Further developments may entail adaptive rank selection, integration with uncertainty-modulated exploration signals, and extension to hierarchical multi-turn agent paradigms. The underlying geometric principles could inform advanced architectures for disentangled skill acquisition and continual learning.
Conclusion
ResRL represents a principled token-level reinforcement paradigm that, via projection residuals and representation-space decoupling, delivers robust reasoning improvements under RLVR, without compromising generation diversity. Empirical superiority over GRPO, NSR, and FlowRL is achieved across mathematics, code, agentic, and function calling benchmarks, validating its efficacy. The approach suggests avenues for scalable RLVR, stable long-horizon optimization, and more reliable semantic control in LLM post-training (2605.00380).