On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral

Published 3 Dec 2025 in cs.CL | (2512.04220v1)

Abstract: Tool-integrated (TI) reinforcement learning (RL) enables LLMs to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a lightweight likelihood-preserving regularization LLDS for GRPO that activates only when a trajectory's likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference to optimization. Across seven open-domain and multi-hop QA benchmarks, our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements, including +37.8% gains on Qwen2.5-3B and +32.0% gains on Qwen2.5-7B. Our results establish LLD as a fundamental bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated LLM.

Abstract PDF Upgrade to Chat

Summary

The paper identifies Lazy Likelihood Displacement (LLD) as the central driver of training collapse in GRPO-based tool-integrated RL.
It details a three-phase collapse—early stagnation, steady decay, and accelerated collapse—highlighting the feedback loop in multi-turn RL.
The paper introduces LLDS, a targeted regularizer that counters negative gradients and improves stability, yielding substantial EM gains across QA tasks.

Analysis of GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral

Problem Context: Instability in GRPO-Based Tool-Integrated RL

Tool-integrated reinforcement learning (RL) frameworks are driving progress in LLM architectures by enabling models to interact with external resources such as search engines and APIs. Recent systems like Search-R1 leverage Group Relative Policy Optimization (GRPO) to achieve fast convergence and avoid reliance on value-function estimation. However, empirical evidence shows that GRPO, when applied to tool-integrated RL (TIRL), is uniquely prone to catastrophic training failures—characterized by reward collapse and divergence, especially in multi-turn settings typical of real-world agentic LLMs.

The root causes of this instability have been largely unaddressed. Prior hypotheses focused on low-likelihood errors and instability from incorrect responses, but lacked a satisfying structural account. The present work systematically investigates and isolates the mechanism underlying these failures, providing a precise, actionable diagnosis and mitigation strategy.

Core Finding: Lazy Likelihood Displacement and the Death Spiral

The central discovery is the identification and empirical characterization of Lazy Likelihood Displacement (LLD) as the fundamental driver of collapse in GRPO-based TIRL. LLD is defined as a regime in which the likelihood of both correct and incorrect responses stagnates or decays throughout policy optimization—contradicting the intended reward-driven improvement. In particular, LLD for correct responses ( $\epsilon \leq 0$ ) initiates a self-reinforcing feedback loop: as correct response likelihood decreases, gradients become inflated, gradient norms explode, and training rapidly collapses.

Three phases are consistently observed across models and tasks:

Phase I (Early Stagnation): Log-likelihoods of correct responses remain nearly static despite early reward increase, marking the silent onset of LLD.
Phase II (Steady Decay): A gradual, monotonic decrease in correct-response likelihood unfolds, even as reward appears to plateau.
Phase III (Accelerated Collapse): Likelihoods drop sharply, gradient norms spike, entropy climbs, and the model enters an unrecoverable state.

This process is labeled the LLD Death Spiral: training is stable until “silent” LLD accumulates, leading to a tipping point where instability is rapidly amplified.

Mechanistic Insight: Source of LLD in Tool-Integrated GRPO

The work provides both informal and formal arguments for why LLD arises in tool-integrated GRPO settings. Notably, two structural phenomena are implicated:

Low-Likelihood Incorrect Responses: Incorrect trajectories with very small predicted likelihoods receive disproportionately large weighting in the gradient sum, amplifying the negative component of the gradient that suppresses correct response probabilities.
Embedding Similarity: In tool-integrated tasks, correct response actions often appear as subcomponents of incorrect responses—their embeddings and token contexts overlap substantially. This amplifies the effect of negative gradients since structurally similar wrong actions penalize correct actions more heavily under group-relative update schemes like GRPO.

Additionally, the paper finds that correct actions embedded within otherwise incorrect responses are particularly susceptible to penalization, deepening the LLD effect and destabilizing training. The presence of external tool feedback (i.e., out-of-distribution tokens) exacerbates this problem by increasing likelihood uncertainty and variance in reward attribution.

Method: Likelihood-Preserving Regularization (LLDS)

To address LLD, the authors introduce LLDS, a highly targeted regularizer that penalizes reductions in trajectory likelihood only when they occur, and only on responsible tokens. The regularization is executed on two layers:

Token-Level Selectivity: Penalization is imposed only on tokens whose likelihood has decreased.
Response-Level Gating: Regularization is applied exclusively to trajectories whose total likelihood has declined.

A further variant, LLDS-MA, excludes answer tokens from regularization to incentivize deeper tool-use by not constraining final answer generation. The overall objective augments the standard GRPO loss with a weighted LLDS term.

This approach contrasts with prior, coarser forms of likelihood regularization by preventing only those updates that cause genuine harm to model likelihood, preserving desirable optimism in the rest of optimization while stably deterring the LLD spiral.

Empirical Validation: Stability and Substantial Performance Gains

Comprehensive experimentation across seven open-domain and multi-hop QA datasets, using Qwen2.5-3B and Qwen2.5-7B model families (Base and Instruct variants), substantiates both the prevalence of the LLD failure mode and the effectiveness of the LLDS regularizer.

Key numerical results:

Qwen2.5-3B-Base (NQ+Hotpot setting): Applying LLDS-MA yields a +37.8% relative EM gain over vanilla GRPO.
Qwen2.5-7B-Base (NQ+Hotpot setting): LLDS raises average EM from 0.350 to 0.462 (+32.0% improvement).
Strong performance gains are consistent across both general (single-hop) and complex multi-hop QA tasks, outstripping comparably sized PPO-trained baselines as well as other retrieval-augmented and supervised, instruction-tuned approaches.

The presence and prevention of LLD were validated by direct measurements: likelihood and entropy dynamics, gradient norms, reward curves, and analysis of per-sample likelihood changes during policy evolution. LLDS effectively quashes gradient explosion, maintains monotonic reward improvement, and encourages deeper multi-step reasoning when configured to do so.

Theoretical and Practical Implications

The identification of LLD as a primary bottleneck in GRPO-driven TIRL is theoretically significant. It motivates the need for direct control over token-level likelihoods—not just reward improvement or value estimation—especially in contexts involving multi-turn tool use and highly entangled trajectories. The findings suggest that value-free, group-relative RL methods require explicit stabilization techniques when deployed in realistic, agentic LLM settings.

Practically, LLDS is simple to implement, model-agnostic, and introduces minimal interference with standard optimization. The results indicate that RL-based training for tool-integrated LLMs is broadly viable and can yield robust, efficient multi-step reasoning when stabilized by trajectory-level likelihood monitoring and control.

Future Directions

This analysis opens several lines of research:

Generalizing LLDS across other value-free RL frameworks and with more complex, hierarchical tool-use agents.
Adaptive regularization scaling and structurally-aware penalty design for diverse reasoning tasks (e.g., program synthesis, multimodal agent environments).
Theoretical convergence analysis of likelihood-regularized RL, especially in the high-gradient, low-likelihood regime characteristic of agentic LLMs.
Integrating likelihood-monitoring diagnostics into standard RL benchmarks to uncover LLD-like instabilities in other domains.

Conclusion

The study provides a rigorous account of training instability in tool-integrated GRPO-based RL, attributing the principal failure mode to Lazy Likelihood Displacement and the resultant death spiral. The proposed LLDS regularization robustly mitigates this instability, yielding substantial empirical gains in open-domain and multi-hop question answering tasks. Control of trajectory-level likelihood emerges as a key ingredient for stable and scalable RL optimization in agentic, tool-augmented LLMs, pointing toward a new direction for robust and reliable reinforcement learning grounded in likelihood dynamics.