CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models

Published 18 May 2025 in cs.LG and cs.AI | (2505.12504v1)

Abstract: Recent advances in rule-based reinforcement learning (RL) have significantly improved the reasoning capability of LMs with rule-based rewards. However, existing RL methods -- such as GRPO, REINFORCE++, and RLOO -- often suffer from training instability, where large policy updates and improper clipping can lead to training collapse. To address this issue, we propose Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel algorithm designed to stabilize policy learning in LMs. CPGD introduces a policy drift constraint based on KL divergence to dynamically regularize policy updates, and leverages a clip mechanism on the logarithm of the ratio to prevent excessive policy updates. We provide theoretical justification for CPGD and demonstrate through empirical analysis that it mitigates the instability observed in prior approaches. Furthermore, we show that CPGD significantly improves performance while maintaining training stability. Our implementation balances theoretical rigor with practical usability, offering a robust alternative for RL in the post-training of LMs. We release our code at https://github.com/ModalMinds/MM-EUREKA.

Abstract PDF Upgrade to Chat

Summary

The paper proposes CPGD, a new algorithm that uses a clipped policy gradient loss and a policy drift regularizer to stabilize reinforcement learning for language models.
It demonstrates significant improvements in training stability and performance compared to methods like GRPO and REINFORCE++ on multimodal benchmarks.
Theoretical convergence proofs and dynamic weighting of advantages offer actionable insights for integrating stable RL into complex language model training.

CPGD: Toward Stable Rule-Based Reinforcement Learning for LLMs

Introduction

"CPGD: Toward Stable Rule-based Reinforcement Learning for LLMs" addresses the instability issues in rule-based reinforcement learning (RL) for LMs, specifically focusing on the problems faced by current RL methods like GRPO, REINFORCE++, and RLOO. The paper presents Clipped Policy Gradient Optimization with Policy Drift (CPGD), an algorithm designed to stabilize policy learning by introducing a policy drift constraint and a clip mechanism to prevent excessive policy updates.

Methodology

CPGD Algorithm

CPGD aims to stabilize policy updates during the training of LMs. The key components of CPGD include:

Policy Gradient Loss with Clipping: Unlike the standard policy gradient approaches involving a direct policy ratio, CPGD utilizes a clipped policy gradient loss. The clip mechanism ensures that policy updates are constrained:

$\Phi_{\theta}(\mathbf{x}, \mathbf{y}) = \min\left( \ln\frac{\pi_\theta(\mathbf{y}|\mathbf{x})}{\pi_{\theta_{old}}(\mathbf{y}|\mathbf{x})} \cdot A^\text{CPGD}(\mathbf{x}, \mathbf{y}), \text{clip}_{\ln(1-\epsilon)}^{\ln(1+\epsilon)}\left(\ln\frac{\pi_\theta(\mathbf{y}|\mathbf{x})}{\pi_{\theta_{old}}(\mathbf{y}|\mathbf{x})}\right)A^\text{CPGD}(\mathbf{x}, \mathbf{y}) \right)$

Policy Drift Regularizer: This involves using the KL divergence between the new and old policies to create a penalty term that prevents large deviations.

$D_\text{KL}(\pi_{\theta_{old}}, \pi_{\theta}|\mathbf{x}) = \mathbb{E}_{\mathbf{y}\sim\pi_{\theta_{old}}(\cdot|\mathbf{x})} \left[ \ln\frac{\pi_{\theta_{old}}(\mathbf{y}|\mathbf{x})}{\pi_{\theta}(\mathbf{y}|\mathbf{x})} \right]$

Weighted Advantages: CPGD incorporates a dynamic weighting factor for the advantages, improving model performance by giving more importance to more informative samples.

Theoretical Foundations

The paper provides a theoretical convergence proof for CPGD. It guarantees that the sequence of policies generated by CPGD will converge, adding to the algorithm's stability and reliability.

Experimental Evaluation

The experiments demonstrate CPGD’s effectiveness across multiple multimodal reasoning benchmarks such as MathVista, MathVerse, MathVision, and MMK12. CPGD was shown to outperform existing RL methods like GRPO and REINFORCE++ consistently. Notably, CPGD achieved marked improvements in training stability and performance, providing a robust alternative to existing methods.

Implementation Details

CPGD can be efficiently integrated into various training frameworks. Here are some key implementation considerations:

Learning Rate and Batch Sizes: The experiments used a learning rate of $1e{-}6$ with batch sizes of 128 for both training and rollout.
Epochs and Responses: Training was conducted over several epochs with eight responses generated per prompt.
System Prompts: The prompts use > and <answer> tags to delineate reasoning steps and final answers, aiding in the structured training of models.

Conclusion

CPGD offers a stable and efficient approach to RL for LMs, sidestepping the instability issues of previous methods by balancing policy updates with the clip mechanism and policy drift. This method not only ensures convergence but also enhances the model's reasoning capabilities across complex multimodal tasks. The proposed algorithm is poised to serve as a significant advancement in the domain of reinforcement learning for LLMs.