LAPP: Large Language Model Feedback for Preference-Driven Reinforcement Learning

Published 21 Apr 2025 in cs.RO, cs.LG, cs.SY, and eess.SY | (2504.15472v1)

Abstract: We introduce LLM-Assisted Preference Prediction (LAPP), a novel framework for robot learning that enables efficient, customizable, and expressive behavior acquisition with minimum human effort. Unlike prior approaches that rely heavily on reward engineering, human demonstrations, motion capture, or expensive pairwise preference labels, LAPP leverages LLMs to automatically generate preference labels from raw state-action trajectories collected during reinforcement learning (RL). These labels are used to train an online preference predictor, which in turn guides the policy optimization process toward satisfying high-level behavioral specifications provided by humans. Our key technical contribution is the integration of LLMs into the RL feedback loop through trajectory-level preference prediction, enabling robots to acquire complex skills including subtle control over gait patterns and rhythmic timing. We evaluate LAPP on a diverse set of quadruped locomotion and dexterous manipulation tasks and show that it achieves efficient learning, higher final performance, faster adaptation, and precise control of high-level behaviors. Notably, LAPP enables robots to master highly dynamic and expressive tasks such as quadruped backflips, which remain out of reach for standard LLM-generated or handcrafted rewards. Our results highlight LAPP as a promising direction for scalable preference-driven robot learning.

Abstract PDF Upgrade to Chat

Summary

The paper introduces LAPP, a novel framework that uses large language models (LLMs) to generate automatic preference feedback for training robot reinforcement learning policies.
Evaluation shows LAPP achieves faster training convergence and higher final performance, enabling robots to perform complex tasks like backflips not possible with standard methods.
LAPP offers a scalable approach for training preference-aligned robot behaviors without extensive manual reward engineering, suggesting a new direction for robot learning.

Overview of LAPP: LLM Feedback for Preference-Driven Reinforcement Learning

The paper "LAPP: LLM Feedback for Preference-Driven Reinforcement Learning" introduces a novel framework, LLM-Assisted Preference Prediction (LAPP), aimed at enhancing robot learning by utilizing the capabilities of LLMs. This approach integrates automatic preference feedback derived from LLMs into the reinforcement learning (RL) process, facilitating efficient policy optimization with minimal human input.

Problem Context

In reinforcement learning, designing effective reward functions remains a significant challenge, as they are often hand-crafted and must align closely with desired objectives and constraints. Traditional methods, such as inverse reinforcement learning and vision-language integrations, attempt to address reward design but often fall short in specifying complex behavioral qualities. LAPP proposes an innovative solution by leveraging LLMs to provide preference judgments over state-action trajectories, creating a scalable mechanism for preference-driven robot learning.

Technical Contributions

The key contribution of LAPP is its framework that allows robots to autonomously learn expressive abilities from human language specifications without extensive manual reward shaping. This is achieved by:

Behavior Instruction: Utilizing LLMs to generate preferences from state-action trajectories based on high-level language instructions regarding desired robot behaviors.
Preference Predictor Training: Employing transformer-based models to predict preference rewards from these LLM-generated labels, maintaining trajectory-informed guidance in the RL loop.
Preference-Driven Reinforcement Learning: Integrating the predicted rewards with traditional environment rewards, thereby optimizing robot policies through an iterative refinement process, adjusting preferences based on dynamic goal criteria throughout training.

Evaluation and Results

The framework was tested across a variety of quadruped locomotion and dexterous manipulation tasks. The results showed that LAPP:

Achieved faster training convergence and higher final performance compared to state-of-the-art baselines.
Enabled behavioral controls such as precise gait patterns and cadence adjustments through language inputs.
Successfully trained robots to perform complex tasks such as quadruped backflips, which standard RL methods could not accomplish.

Implications and Future Directions

The practical implications of LAPP are substantial, offering a methodology for autonomously preference-aligned robot behaviors without extensive reward engineering. This suggests a potential shift in RL paradigms, prioritizing scalable, preference-driven learning over static reward design approaches.

Future exploration can revolve around several areas:

Further refinement in LLM preference generation to reduce computational overhead.
Extending LAPP to tasks involving visual state trajectories to enable more comprehensive applications.
Investigating automated selection mechanisms of state variables for improved preference accuracy at reduced costs.

In summary, LAPP enhances robot RL by aligning learning processes with high-level language specifications, highlighting a promising direction in AI research for scalable, preference-aware autonomous systems.