Anyprefer: An Agentic Framework for Preference Data Synthesis

Published 27 Apr 2025 in cs.LG, cs.AI, and cs.CL | (2504.19276v1)

Abstract: High-quality preference data is essential for aligning foundation models with human values through preference learning. However, manual annotation of such data is often time-consuming and costly. Recent methods often adopt a self-rewarding approach, where the target model generates and annotates its own preference data, but this can lead to inaccuracies since the reward model shares weights with the target model, thereby amplifying inherent biases. To address these issues, we propose Anyprefer, a framework designed to synthesize high-quality preference data for aligning the target model. Anyprefer frames the data synthesis process as a cooperative two-player Markov Game, where the target model and the judge model collaborate together. Here, a series of external tools are introduced to assist the judge model in accurately rewarding the target model's responses, mitigating biases in the rewarding process. In addition, a feedback mechanism is introduced to optimize prompts for both models, enhancing collaboration and improving data quality. The synthesized data is compiled into a new preference dataset, Anyprefer-V1, consisting of 58K high-quality preference pairs. Extensive experiments show that Anyprefer significantly improves model alignment performance across four main applications, covering 21 datasets, achieving average improvements of 18.55% in five natural language generation datasets, 3.66% in nine vision-language understanding datasets, 30.05% in three medical image analysis datasets, and 16.00% in four visuo-motor control tasks.

Abstract PDF Upgrade to Chat

Summary

Anyprefer: An Agentic Framework for Preference Data Synthesis

The paper, "Anyprefer: An Agentic Framework for Preference Data Synthesis," addresses critical challenges in the generation of high-quality preference data, aiming to enhance the alignment of foundation models with human values. The main contribution of the study is the proposal of the Anyprefer framework, a novel method that automatically synthesizes preference data by modeling the data synthesis process as a cooperative two-player Markov Game. This process strategically involves a target model and a judge model, supplemented by external tools to maintain bias-free rewarding and ranking.

Methodology

Anyprefer defines preference data synthesis as a cooperative game involving a target model and a judge model. The target model generates candidate responses based on input prompts, while the judge model evaluates these responses using information aggregated from various external tools. This setup mitigates biases associated with using the target model's rewards as self-generated benchmarks. The framework also introduces a feedback mechanism to optimize input prompts for both models, thus improving the collaboration and quality of the generated data.

A series of steps are followed in the Anyprefer framework:

Response Sampling: The target model generates multiple responses to given prompts.
Response Rewarding: The judge model ranks these responses using strategically integrated external knowledge.
Data Quality Evaluation: A reward model assesses the quality of ranked preference pairs, providing surrogate rewards that guide iterative refinements.
Feedback Mechanism: Prompt optimization for the models allows them to produce higher-quality responses in subsequent iterations.

The synthesized data culminates in Anyprefer-V1, a dataset comprising 58K high-quality preference pairs, robustly validated across multiple application domains.

Experimental Outcomes

The effectiveness of Anyprefer is substantiated through extensive experiments across four major application areas:
- Natural Language Generation: Anyprefer achieves a significant 18.55% average improvement across five datasets, demonstrating enhanced capabilities in generating coherent and context-appropriate responses.
- Vision-Language Understanding: Improvements averaged 3.66% across nine benchmarks, emphasizing the framework's effectiveness in multimodal domains.
- Medical Image Analysis: Notable enhancements of 30.05% highlight Anyprefer's strength in niche, domain-specific tasks.
- Visuo-Motor Control: Success rates increased by 16.00% in four robotics tasks, showcasing the framework's adaptability to complex control environments.

Implications and Future Directions

The findings have pivotal implications for the future development of AI systems, suggesting that Anyprefer’s tool-augmented rewards and feedback mechanisms could significantly reduce reliance on manual data annotation, offering scalable and adaptable methods for aligning models with human values. Furthermore, the iterative preference fine-tuning process documented in the paper underscores the framework's potential for self-improvement and adaptive learning.

In terms of future research, Anyprefer paves the pathway for broader applications in AI, particularly in fields demanding nuanced understanding and complex reasoning, such as legal and ethical AI systems, autonomous driving, and personalized medicine.

The insights provided by this paper offer substantial contributions to both theoretical and practical aspects of AI alignment, demonstrating a viable pathway for future model development and enhancement.