Papers
Topics
Authors
Recent
Search
2000 character limit reached

Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards

Published 25 Jun 2025 in cs.LG and cs.CL | (2506.20520v1)

Abstract: Reinforcement learning (RL) is increasingly used to align LLMs. Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques, but often result in suboptimal performance. In this work, we study the intermediate range of algorithms between off-policy RL and supervised fine-tuning by analyzing a simple off-policy REINFORCE algorithm, where the advantage is defined as $A=r-V$, with $r$ a reward and $V$ some tunable baseline. Intuitively, lowering $V$ emphasizes high-reward samples, while raising it penalizes low-reward ones more heavily. We first provide a theoretical analysis of this off-policy REINFORCE algorithm, showing that when the baseline $V$ lower-bounds the expected reward, the algorithm enjoys a policy improvement guarantee. Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones. We validate our findings experimentally in a controlled stochastic bandit setting and through fine-tuning state-of-the-art LLMs on reasoning tasks.

Summary

  • The paper shows that selecting a baseline below the expected reward yields stable, monotonic policy improvement while preventing premature support collapse.
  • It rigorously analyzes the impact of baseline choice, validated by bandit experiments and large-scale LLM fine-tuning tests that reveal a critical phase transition.
  • The study provides actionable guidance for off-policy reinforcement learning, recommending conservative baseline settings to effectively balance positive and negative rewards.

Asymmetric REINFORCE for Off-Policy Reinforcement Learning: Balancing Positive and Negative Rewards

The paper introduces and analyzes Asymmetric REINFORCE (AsymRE), a simple yet theoretically grounded off-policy reinforcement learning (RL) algorithm, with a particular focus on its application to LLM fine-tuning. The central insight is that, in off-policy RL, the choice of baseline in the REINFORCE objective fundamentally alters both the training dynamics and the asymptotic behavior of the learned policy. The authors provide a rigorous theoretical analysis, empirical validation in bandit settings, and large-scale experiments with LLMs, demonstrating the practical implications of their findings.

Theoretical Contributions

The AsymRE algorithm is defined by the objective:

J(Ļ€)=Ey∼μ[log⁔π(y)(r(y)āˆ’V)]J(\pi) = \mathbb{E}_{y \sim \mu} \left[ \log \pi(y) (r(y) - V) \right]

where μ\mu is the behavior policy, Ļ€\pi is the current policy, r(y)r(y) is the reward, and VV is a tunable baseline. Unlike on-policy REINFORCE, where the baseline serves only to reduce variance, in the off-policy setting the baseline VV introduces a bias that can be leveraged to control the emphasis on positive versus negative rewards.

The authors provide a detailed analysis in the tabular setting, showing that:

  • If V<VμV < V^\mu (the expected reward under μ\mu), AsymRE converges to a policy that improves upon μ\mu and maintains broad support.
  • If V≄VμV \geq V^\mu, a phase transition occurs: the policy's support collapses, often to a singleton, leading to premature convergence and loss of diversity.
  • Iterative application of AsymRE with V<VμV < V^\mu yields monotonic policy improvement, with the mass of the policy concentrating exponentially fast on the optimal set.

This analysis reveals a critical asymmetry: off-policy updates benefit from focusing on positive rewards (i.e., using a lower baseline), whereas negative rewards from off-policy data are less informative and can be detrimental if overemphasized.

Empirical Validation

Bandit Experiments

In a controlled multi-armed bandit setting, the authors demonstrate that:

  • As VV approaches VμV^\mu from below, the expected reward of the learned policy increases, but the policy's support shrinks.
  • Crossing the threshold V=VμV = V^\mu leads to a sudden collapse in support and diversity, confirming the theoretical phase transition.
  • Policy improvement schemes with V<VμV < V^\mu yield consistent improvement, while V≄VμV \geq V^\mu results in suboptimal, deterministic policies.

LLM Fine-Tuning

The AsymRE objective is adapted for LLMs by using a context-corrected baseline:

Ex∼D,y∼μ(ā‹…āˆ£x)[log⁔π(y∣x)(r(y,x)āˆ’Vμ(ā‹…āˆ£x)āˆ’Ī“V)]\mathbb{E}_{x \sim \mathcal{D}, y \sim \mu(\cdot|x)} \left[ \log \pi(y|x) (r(y, x) - V^{\mu(\cdot|x)} - \delta V) \right]

where xx is a prompt, and ΓV\delta V is a small conservative correction.

Key findings from experiments with Llama 8B and Qwen 3B on the MATH dataset:

  • Training is stable and performance improves as Ī“V\delta V approaches $0$ from below.
  • When Ī“V≄0\delta V \geq 0, both training and test accuracy collapse catastrophically, and the entropy of the policy drops, indicating loss of diversity.
  • A small negative Ī“V\delta V (e.g., āˆ’0.1-0.1) consistently prevents collapse and yields more robust training.

Implications and Discussion

The results have several important implications for both theory and practice:

  • Off-Policy RL for LLMs: The findings provide a principled approach to off-policy RL in LLM fine-tuning, where strict on-policy data collection is often infeasible due to computational and engineering constraints.
  • Baseline Selection: The baseline in off-policy REINFORCE is not merely a variance reduction tool but a critical hyperparameter that governs the trade-off between learning from positive and negative examples. Conservative (lower) baselines are preferable in off-policy settings.
  • Policy Diversity: Maintaining policy diversity is essential in high-dimensional, multi-task settings such as LLMs. Overemphasis on negative rewards (high baseline) can lead to overfitting and poor generalization.
  • Practical Guidance: For practitioners, the recommendation is to set the baseline slightly below the empirical average reward of the behavior policy, ensuring stable and effective off-policy learning.

Future Directions

The paper suggests several avenues for further research:

  • Extending the analysis to more sophisticated objectives incorporating importance sampling or KL regularization.
  • Quantifying the computational and sample efficiency gains from reusing off-policy data in large-scale LLM training.
  • Investigating the interplay between baseline selection and other regularization techniques in RLHF and related alignment methods.

Conclusion

AsymRE offers a theoretically sound and practically effective method for off-policy RL, particularly in the context of LLM alignment and fine-tuning. The work clarifies the nuanced role of the baseline in off-policy policy gradient methods and provides actionable insights for stable and efficient RL-based training of large models. The phase transition phenomenon identified here is especially relevant for practitioners seeking to balance performance and diversity in real-world RL applications.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 114 likes about this paper.