Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sample Efficient Preference Alignment in LLMs via Active Exploration

Published 1 Dec 2023 in cs.LG, cs.AI, and stat.ML | (2312.00267v3)

Abstract: Preference-based feedback is important for many applications in machine learning where evaluation of a reward function is not feasible. Notable recent examples arise in preference alignment for LLMs, including in reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). For many applications of preference alignment, the cost of acquiring human feedback can be substantial. In this work, we take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy, and formalize the setting as an active contextual dueling bandit problem. We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a polynomial worst-case regret bound. We extend the setting and methodology for practical use in preference alignment of LLMs. We provide two extensions, an online and an offline approach. Our method outperforms the baselines with limited samples of human preferences on several LLMs and four real-world datasets including two new datasets that we contribute to the literature.

Citations (17)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 29 likes about this paper.