- The paper proposes CPGD, a new algorithm that uses a clipped policy gradient loss and a policy drift regularizer to stabilize reinforcement learning for language models.
- It demonstrates significant improvements in training stability and performance compared to methods like GRPO and REINFORCE++ on multimodal benchmarks.
- Theoretical convergence proofs and dynamic weighting of advantages offer actionable insights for integrating stable RL into complex language model training.
CPGD: Toward Stable Rule-Based Reinforcement Learning for LLMs
Introduction
"CPGD: Toward Stable Rule-based Reinforcement Learning for LLMs" addresses the instability issues in rule-based reinforcement learning (RL) for LMs, specifically focusing on the problems faced by current RL methods like GRPO, REINFORCE++, and RLOO. The paper presents Clipped Policy Gradient Optimization with Policy Drift (CPGD), an algorithm designed to stabilize policy learning by introducing a policy drift constraint and a clip mechanism to prevent excessive policy updates.
Methodology
CPGD Algorithm
CPGD aims to stabilize policy updates during the training of LMs. The key components of CPGD include:
- Policy Gradient Loss with Clipping: Unlike the standard policy gradient approaches involving a direct policy ratio, CPGD utilizes a clipped policy gradient loss. The clip mechanism ensures that policy updates are constrained:
Φθ(x,y)=min(lnπθold(y∣x)πθ(y∣x)⋅ACPGD(x,y),clipln(1−ϵ)ln(1+ϵ)(lnπθold(y∣x)πθ(y∣x))ACPGD(x,y))
- Policy Drift Regularizer: This involves using the KL divergence between the new and old policies to create a penalty term that prevents large deviations.
DKL(πθold,πθ∣x)=Ey∼πθold(⋅∣x)[lnπθ(y∣x)πθold(y∣x)]
- Weighted Advantages: CPGD incorporates a dynamic weighting factor for the advantages, improving model performance by giving more importance to more informative samples.
Theoretical Foundations
The paper provides a theoretical convergence proof for CPGD. It guarantees that the sequence of policies generated by CPGD will converge, adding to the algorithm's stability and reliability.
Experimental Evaluation
The experiments demonstrate CPGD’s effectiveness across multiple multimodal reasoning benchmarks such as MathVista, MathVerse, MathVision, and MMK12. CPGD was shown to outperform existing RL methods like GRPO and REINFORCE++ consistently. Notably, CPGD achieved marked improvements in training stability and performance, providing a robust alternative to existing methods.
Implementation Details
CPGD can be efficiently integrated into various training frameworks. Here are some key implementation considerations: