Unbiased sequence-level reverse KL gradient estimation in off-policy RL
Develop and analyze an unbiased estimator for the gradient of the sequence-level reverse Kullback–Leibler divergence D_KL(πθ(·|x) || πref(·|x)) in off-policy reinforcement learning settings for large language model post-training, where samples are generated from a delayed sampling policy (πθ_old), so that the estimator is provably unbiased at the sequence level under off-policy data.
References
Note that none of the four KL configurations listed above would give an unbiased sequence level reverse KL gradient estimate in off-policy settings. … Zhang (2025) note that an unbiased gradient estimate of the token-level reverse KL gradient can be induced by multiplying a token-level importance sampling ratio ω_t with the KL estimate, when added to the loss directly. We leave the implementation and analysis of an unbiased sequence level reverse KL gradient estimate in off-policy settings for future work.