MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

Published 3 Oct 2024 in cs.CL | (2410.02743v2)

Abstract: Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning LLMs with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to preferred outcomes. This hinders learning efficiency and slows convergence.In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7 ~ 2 times faster in terms of training time and continues to outperform it with further training. We make our code and data publicly available at https://github.com/ernie-research/MA-RLHF.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces MA-RLHF, a framework that uses macro actions to solve token-level credit assignment issues in long sequence reinforcement learning.
The approach achieves up to 30% improvements in tasks like text summarization and code generation, reducing training time by 1.7x to 2x.
MA-RLHF employs fixed n-gram, parsing-based, and perplexity-based termination strategies to enhance policy gradient estimates and scalability.

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

This paper presents a novel framework called MA-RLHF, which integrates macro actions into the reinforcement learning from human feedback (RLHF) paradigm, aiming to address the shortcomings of token-level RLHF in LLMs. Specifically, token-level RLHF often struggles with the credit assignment problem across long sequences due to delayed rewards, hindering learning efficiency. By incorporating macro actions, MA-RLHF introduces a higher level of abstraction, enhancing credit assignment and learning efficiency without increasing computational demands.

Key Contributions and Results

The core innovation of MA-RLHF lies in its use of macro actions—sequences of tokens or higher-level language constructs—instead of individual tokens. This abstraction reduces the temporal distance between actions and rewards, thereby facilitating more accurate credit assignment and providing more stable policy gradient estimates. The approach is experimentally validated across various tasks, including text summarization, dialogue generation, question answering, and program synthesis. The reported performance improvements are notable, with gains of up to 30% in text summarization and code generation tasks.

MA-RLHF reaches parity with standard token-level RLHF significantly faster—1.7x to 2x quicker in training time—while continuing to outperform it with further training. This efficiency, coupled with an absence of increased computational complexity, underscores the practical benefits of the macro action approach.

Macro Action Framework

The paper advances the concept of macro actions by proposing three primary termination strategies: fixed n-gram-based, parsing-based, and perplexity-based. These strategies construct macro actions by grouping sequences of tokens, which are optimized using Proximal Policy Optimization (PPO) at the macro action level.

Fixed n-gram-based termination: Commends simplicity by grouping tokens into fixed-length n-grams, improving learning efficiency and scalability.
Parsing-based termination: Utilizes syntactic structures to align macro actions with grammatical constructs, capturing linguistic dependencies more effectively.
Perplexity-based termination: Leverages LLM perplexity to dynamically form macro actions by identifying sequences that contribute to decreasing perplexity.

Through these strategies, MA-RLHF adapts and extends the classical policy optimization approaches, demonstrating robustness and enhanced performance across multiple dimensions.

Evaluation and Implications

Evaluation using a combination of reward model scores, GPT-4 pairwise comparison, and human pairwise evaluation indicates that MA-RLHF consistently outperforms the baseline methods across tasks. Notably, it maintains scalability across varying model sizes, achieving robust generalization capabilities.

The implications of MA-RLHF are significant for both practical and theoretical aspects of AI development. Practically, the approach offers a more efficient method for aligning LLMs with human preferences, reducing computational overheads and speeding up the training process. Theoretically, it highlights the utility of macro actions in overcoming credit assignment challenges, potentially influencing future research in hierarchical reinforcement learning and policy optimization in LLMs.

Future Directions

Potential future developments could involve exploring more sophisticated or learnable strategies for macro action formation, enhancing adaptability and precision in diverse environments. Extending the framework to other models and datasets could further validate its effectiveness and versatility.

In summary, MA-RLHF represents a significant advancement in RLHF methodologies, demonstrating strong performance improvements through the innovative use of macro actions. Its contributions offer valuable insights into efficient LLM alignment, with broad implications for future research and application in AI.