Unbiased sequence-level reverse KL gradient estimation in off-policy RL

Develop and analyze an unbiased estimator for the gradient of the sequence-level reverse Kullback–Leibler divergence D_KL(πθ(·|x) || πref(·|x)) in off-policy reinforcement learning settings for large language model post-training, where samples are generated from a delayed sampling policy (πθ_old), so that the estimator is provably unbiased at the sequence level under off-policy data.

Background

The paper studies how different KL divergence estimators and their placement (reward vs loss) affect the gradients used in reinforcement learning post-training of LLMs. In on-policy settings, certain configurations yield unbiased gradients, while others introduce bias and training instability.

In off-policy (asynchronous) settings, the authors note that none of the four commonly used configurations produce an unbiased sequence-level reverse KL gradient. Although prior work suggests token-level unbiasedness can be achieved via importance weighting, a correct and implemented sequence-level unbiased gradient estimator for reverse KL under off-policy sampling is currently lacking, motivating this explicit open problem.

References

Note that none of the four KL configurations listed above would give an unbiased sequence level reverse KL gradient estimate in off-policy settings. … Zhang (2025) note that an unbiased gradient estimate of the token-level reverse KL gradient can be induced by multiplying a token-level importance sampling ratio ω_t with the KL estimate, when added to the loss directly. We leave the implementation and analysis of an unbiased sequence level reverse KL gradient estimate in off-policy settings for future work.

A Comedy of Estimators: On KL Regularization in RL Training of LLMs  (2512.21852 - Shah et al., 26 Dec 2025) in Section 3, Subsection “Inspecting KL estimators”