On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

Published 23 May 2025 in cs.LG, cs.AI, and cs.CL | (2505.17508v2)

Abstract: Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of LLMs. KL regularization is ubiquitous, yet the design surface, choice of KL direction (forward vs. reverse), normalization (normalized vs. unnormalized), and estimator ($k_1/k_2/k_3$), is scattered across the literature and often intertwined with off-policy estimation. We ask a focused question: under the off-policy setting, what weighting is required for each KL variant so that the surrogate we optimize yields the exact gradient of the intended KL-regularized objective? We answer this with a compact, unified derivation we call the Regularized Policy Gradient (RPG) view. RPG (i) unifies normalized and unnormalized KL variants and shows that the widely-used $k_3$ penalty is exactly the unnormalized KL; (ii) specifies conditions under which REINFORCE-style losses with stop-gradient are gradient-equivalent to fully differentiable surrogates; (iii) identifies and corrects an off-policy importance-weighting mismatch in GRPO's KL term; and (iv) introduces RPG-Style Clip, a truncated-importance-sampling step within RPG-REINFORCE that enables stable, off-policy policy-gradient training at scale. On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO. Notably, RPG is a stable and scalable RL algorithm for LLM reasoning, realized via (a) a KL-correct objective, (b) truncated importance sampling, and (c) an iterative reference-policy update scheme.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the Regularized Policy Gradient (RPG) framework, integrating KL divergences to stabilize LLM training.
It details derivation of fully differentiable and REINFORCE-style loss functions to effectively manage off-policy learning.
Extensive experiments demonstrate RPG's enhanced performance on mathematical reasoning benchmarks compared to existing methods.

On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

The paper "On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning" explores the integration of Kullback-Leibler (KL) divergence into policy gradient algorithms to improve LLM reasoning. By investigating different KL divergence formulations, this research highlights optimization techniques that leverage forward and reverse KL divergences within the reinforcement learning context to enhance training stability and performance. This essay dissects the methodological contributions, implementation insights, and empirical evaluations presented in the paper.

Regularized Policy Gradient Framework

The paper introduces a framework called Regularized Policy Gradient (RPG), which systematically derives and analyzes policy gradient methods regularized by KL divergences. The framework accommodates both forward and reverse KL divergences, permitting the exploration of normalized and unnormalized policy distributions. This diversification allows RPG to adapt to various algorithmic needs, offering significance in online reinforcement learning environments.

Figure 1: An overview of RPG's iterative training framework focusing on KL-regularized objectives for LLM reasoning tasks.

Derivation and Implementation Approaches

RPG constructs surrogate loss functions corresponding to KL-regularized objectives, specifically focusing on objectives regularized by normalized (KL) and unnormalized (UKL) formulations. The derivations encompass fully differentiable loss functions and REINFORCE-style gradient estimators:

Fully Differentiable Losses: These are formulated to achieve direct alignment with original gradients for gradient-descent-based optimization. For instance, a surrogate loss for forward KL is derived to maximize expected rewards while controlling policy divergence.
REINFORCE-Style Estimators: The framework adapts these estimators to manage off-policy learning scenarios through importance sampling techniques, including using stop-gradient operators to effectively implement policy updates.

Figure 2: Comparative results highlighting RPG variants' performance in efficient learning against established baselines.

Theoretical Insights and Corrections

One significant theoretical contribution lies in identifying and correcting an inconsistency in the gradient estimation of the Group Relative Policy Optimization (GRPO) algorithm's KL objective. By proposing a corrected gradient estimator, the paper improves upon existing methodologies by incorporating importance weighting accurately within the objective function.

Computational Considerations

The computational efficiency of RPG is augmented by its design choice of not requiring dual models in memory simultaneously. By utilizing precomputed probabilities from a previous policy iteration, RPG enhances computational efficiency, reducing the typical memory burden associated with reference-based regularization methods.

Empirical Evaluation

Extensive experiments on reasoning tasks for LLMs confirm the effectiveness of the RPG framework. Compared against baselines like GRPO and REINFORCE++, the RPG methods exhibit improved training stability and competitive performance. Notably, results on mathematical reasoning benchmarks demonstrate RPG's ability to sustain high performance, effectively balancing exploration and exploitation through its KL regularization strategies.

Figure 3: Visualization of reward metrics and training dynamics demonstrating RPG's stability on LLM reasoning tasks.

Performance Metrics and Results

The experiments reveal strong performance of RPG methods across various tasks, particularly highlighting substantial improvements on tasks such as AMC23 and AIME24. These results underscore RPG's capability to maintain robust training dynamics, as shown by sustained reward metrics and control over policy entropy.

Conclusion

The introduction of KL-Regularized Policy Gradient algorithms presents a refined approach to optimizing LLMs for complex reasoning tasks by structurally integrating KL divergence into reinforcement learning frameworks. The RPG framework not only enhances empirical performance but also offers theoretical insights with practical implementation strategies that efficiently manage computational resources. Future explorations might extend these methodologies to broader applications involving sparse or complex reward structures typical in natural language processing and decision-making scenarios.