Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

Published 25 Sep 2024 in cs.LG and stat.ML | (2409.17401v2)

Abstract: Reward inference (learning a reward model from human preferences) is a critical intermediate step in the Reinforcement Learning from Human Feedback (RLHF) pipeline for fine-tuning LLMs. In practice, RLHF faces fundamental challenges such as distribution shift, reward model overfitting, and problem misspecification. An alternative approach is direct policy optimization without reward inference, such as Direct Preference Optimization (DPO), which provides a much simpler pipeline and has shown empirical success in LLM applications. However, DPO utilizes the closed-form expression between the optimal policy and the reward function, which is only suitable under the bandit setting or deterministic MDPs. This paper develops two RLHF algorithms without reward inference for general RL problems beyond bandits and deterministic MDPs, and general preference models beyond the Bradley-Terry model. The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator. For both algorithms, we establish polynomial convergence rates in terms of the number of policy gradient iterations, the number of trajectory samples, and human preference queries per iteration. Numerical experiments in stochastic environments validate the performance of our proposed algorithms, outperforming popular RLHF baselines such as DPO and PPO. Our paper shows there exist provably efficient methods to solve general RLHF problems without reward inference.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a direct policy optimization method that avoids traditional reward inference by using zeroth-order gradient estimation.
ZPG perturbs policy parameters and ZBCPG updates parameter blocks in parallel to efficiently integrate human feedback.
Both algorithms demonstrate provable convergence with polynomial sample complexity, simplifying and enhancing the RLHF pipeline.

Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

In the domain of Reinforcement Learning from Human Feedback (RLHF), a key challenge is to effectively refine policies for LLMs without undergoing the conventional reward inference stage. This paper introduces novel methodologies that circumvent this stage altogether. The research proposal involves two algorithms: Zeroth-Order Policy Gradient (ZPG) and Zeroth-Order Block-Coordinate Policy Gradient (ZBCPG), both of which incorporate human feedback directly into reinforcement learning frameworks.

Key Contributions

Reward Inference Challenges: Conventional RLHF processes rely on inferring a reward function from human feedback to train the model further, posing multiple challenges including mis-specification, lack of ground truths for model evaluation, and distribution shifts leading to overfitting.
Direct Policy Optimization: This is introduced as an alternative strategy, aided by algorithms that operate without explicit reward modeling, leveraging human preferences and bypassing the traditional reward inference. Direct Preference Optimization (DPO) is noted as an existing method, but with limitations due to its assumptions, like the determinism of Markov Decision Processes (MDPs).
Zeroth-Order Gradient Estimation: The research leverages the zeroth-order optimization approach, applying it to human preferences in RLHF to infer the directional gradients needed for policy update. This is a departure from the bandit configuration and goes beyond deterministic MDP constraints.

Algorithms Proposed

ZPG (Zeroth-Order Policy Gradient)
- Utilizes perturbations in policy parameters to estimate potential improvements by observing human feedback on trajectories comparing the altered and unaltered policies.
- The gradient estimation process refers to empirical estimates derived from human feedback and applies this in a zeroth-order gradient ascent framework to improve policies iteratively.
ZBCPG (Zeroth-Order Block Coordinate Policy Gradient)
- Computes gradients by sampling multiple coordinates and assessing their impact simultaneously; a form of parallel policy refinement.
- Offers computational advantages by updating selected blocks of parameters rather than the whole parameter space, enabling efficient handling of high-dimensional spaces.

Theoretical Insights

Both algorithms promise provable convergence to stationary policies under the constraints of their framework, demonstrating how they can efficiently leverage human feedback:

The convergence rate and sample complexity are rigorously analyzed, asserting that these methods converge polynomially with respect to their input parameters and assumptions.
These methods yield a convergence rate characterized by a combination of factors including the planning horizon, feedback queries, and computational dimensions.

Implications and Future Directions

Practical Implications:

These methods simplify the RLHF pipeline, mitigating the complexities inherent in reward model specification and training.
They enhance scalability and have the potential for application in real-world scenarios where quick iterations over policy updates are valuable.

Theoretical Implications:

The work generates new lines of inquiry into the intersection of gradient-free optimization techniques and reinforcement learning, opening up avenues for wider applicability in non-conventional MDPs.
It lays the groundwork for the development of reinforcement learning algorithms that engage more directly with intuitive human feedback rather than inferred reward schemas.

Speculations:

Future research may explore the integration of these methods into more complex environments and tasks, potentially involving partial observability or adversarial settings.
Expanding the boundary conditions and assumptions under which these algorithms operate can offer insights into robust RLHF methodologies for more varied operational contexts.

In synthesizing these components, the paper outlines a substantial refinements in the implementation and theoretical framing of RLHF paradigms, facilitating more direct engagement with human evaluators and opening pathways to more adaptable and efficient policy learning models.