REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models

Published 4 Jan 2025 in cs.CL and cs.LG | (2501.03262v8)

Abstract: Reinforcement Learning from Human Feedback (RLHF) plays a crucial role in aligning LLMs with human values and preferences. While state-of-the-art applications like ChatGPT or GPT-4 commonly employ Proximal Policy Optimization (PPO), the inclusion of a critic network introduces significant computational overhead. REINFORCE-based methods, such as REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO), address this limitation by eliminating the critic network. However, these approaches face challenges in accurate advantage estimation. Specifically, they estimate advantages independently for responses to each prompt, which can lead to overfitting on simpler prompts and vulnerability to reward hacking and may be biased. To address these challenges, we introduce REINFORCE++, a novel approach that removes the critic model while using the global advantage normalization which is unbiased to improve the training stability. Our empirical evaluation demonstrates that REINFORCE++ exhibits robust performance across various reward models without requiring prompt set truncation. Furthermore, it achieves superior generalization in both RLHF and long chain-of-thought (CoT) settings compared to existing REINFORCE-based methods. The implementation is available at https://github.com/OpenRLHF/OpenRLHF.

Abstract PDF Upgrade to Chat

Summary

The paper introduces REINFORCE++, an RLHF algorithm that improves convergence and stability by removing the critic network and integrating token-level KL penalty with a PPO-clip mechanism.
The method demonstrates significant computational efficiency and reduced training time compared to PPO and GRPO through mini-batch processing and reward normalization.
Experimental results validate the algorithm’s robust performance in mitigating reward hacking and effectively aligning large language models with human feedback.

REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models

Introduction

The development of "REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models" introduces an advancement in the methodology of Reinforcement Learning from Human Feedback (RLHF), specifically tailored for LLMs. The proposed REINFORCE++ algorithm is an enhancement of the classical REINFORCE algorithm, drawing upon elements of Proximal Policy Optimization (PPO) while removing the need for a critic network. This alignment-focused approach aims to offer simplicity, improved training stability, and economic use of computational resources. Through rigorous empirical comparisons, REINFORCE++ exhibits notable robustness and efficiency relative to existing methodologies such as PPO and Group Relative Policy Optimization (GRPO).

RLHF Challenges and the Role of REINFORCE++

In RLHF, feedback integration is pivotal for aligning model outputs with human preferences. Traditional methods like PPO demand substantial computational resources largely due to the inclusion of a critic network, which can introduce training instability and scalability issues. REINFORCE++ addresses these challenges by adopting a streamlined architecture that bypasses the critic network, thereby reducing computational demands and enhancing training stability.

Enhancements in REINFORCE++

Algorithmic Innovations

REINFORCE++ incorporates several strategic modifications:

Token-Level KL Penalty: By integrating a token-level KL divergence penalty, the method ensures tighter control over policy shifts, enhancing alignment without excessive deviation.
PPO-Clip Mechanism: Utilizing the PPO-clipping strategy, the algorithm contains fluctuations in policy updates, thus maintaining stability and adhering within a reliable convergence domain.
Mini-Batch Processing: The adoption of mini-batch updates facilitates expedited convergence and efficient resource utilization, promoting scalability across extensive datasets.
Reward and Advantage Normalization: Normalizing rewards and advantages enhances numerical robustness and prevents instability that might arise from high variance.

These enhancements collectively serve to fortify the gradient estimation process against variance-induced volatility, a typical drawback observed in conventional REINFORCE algorithms.

Experimental Setup and Results

Empirical evaluations of REINFORCE++ involved comparisons against PPO and GRPO across diverse datasets, accentuating the algorithm's alignment efficacy and resource efficiency. The benchmarks included general domain scenarios using Bradley-Terry reward models and specialized mathematical domains, demonstrating REINFORCE++'s adaptability and competence across varied contexts.

Key observations from the experiments include:

Training Stability: Superior performance in preventing undesirable phenomena such as reward hacking was observed in general scenarios. In mathematical contexts, REINFORCE++ reported substantial reward increases, validating its strategic advantage approach.
Computational Efficiency: Compared to PPO, REINFORCE++ showed a significant reduction in both memory usage and training time, underscoring its potential for cost-effective deployment, particularly relevant for large-scale applications on architectures such as the NVIDIA H100.

Conclusion

The introduction of REINFORCE++ signifies a strategic advancement in RLHF methodologies, effectively balancing performance alignment with computational economy. Its robust design principles, derived from classical algorithms yet innovatively enhanced, may inform future expansions and adaptations in RLHF systems for increasingly complex AI alignments. By reducing overheads while bolstering stability and precision, REINFORCE++ stands as a credible alternative to incumbent technologies, paving the path for expanded research and practical deployment in AI alignment frameworks.