Language Model Preference Evaluation with Multiple Weak Evaluators

Published 14 Oct 2024 in cs.CL, cs.AI, and cs.LG | (2410.12869v3)

Abstract: Despite the remarkable success of LLMs, evaluating their outputs' quality regarding preference remains a critical challenge. Existing works usually leverage an LLM as the judge for comparing LLMs' output pairwisely, yet such model-based evaluator is weak evaluator due to conflicting preference, i.e., output A is better than B, B than C, but C than A, causing contradictory evaluation results. To address this, we introduce GED (Preference Graph Ensemble and Denoise), a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensemble and denoise these graphs for better, non-contradictory evaluation results. In particular, our method consists of two primary stages: aggregating evaluations into a unified graph and applying a denoising process to eliminate cyclic inconsistencies, ensuring a directed acyclic graph (DAG) structure. We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure. Extensive experiments on ten benchmarks demonstrate GED's superiority in three applications: model ranking, response selection, and model alignment tasks. Notably, GED combines small LLM evaluators (e.g., Llama3-8B, Mistral-7B, Qwen2-7B) to outperform stronger ones (e.g., Qwen2-72B), showcasing its effectiveness in enhancing evaluation reliability and improving model performance.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces the Preference Graph Ensemble and Denoise (PGEaD) framework to convert cyclic preference graphs into acyclic ones for reliable LLM evaluation.
It demonstrates improved performance over baselines in tasks like model ranking and response selection across benchmarks such as HumanEval and MATH.
The study highlights that combining multiple weak evaluators can effectively capture true preference structures while reducing computational costs.

LLM Preference Evaluation with Multiple Weak Evaluators

The paper entitled "LLM Preference Evaluation with Multiple Weak Evaluators" presents a novel approach to enhance the reliability of preference evaluation in LLMs. The authors address a prominent issue in LLM evaluation where strong models, like GPT-4, inconsistently assess model outputs due to cyclic preference patterns. These cycles degrade the reliability of such evaluations, highlighting the need for new methodologies.

Methodology

The proposed framework, termed Preference Graph Ensemble and Denoise (PGEaD), leverages multiple weak evaluators to construct preference graphs. The process is bifurcated into two essential phases: aggregation of these evaluations into a unified preference graph and a subsequent denoising process that ensures the graph is acyclic. The denoised version highlights consistent evaluation by removing cyclical inconsistencies, transforming the graph into a Directed Acyclic Graph (DAG).

The framework hinges on the construction of preference graphs, where vertices represent responses and directed edges indicate pairwise preferences. Noise manifests as cycles within these graphs. Eliminating such noise involves deriving a Feedback Arc Set (FAS) and minimizing it to ensure a DAG structure.

Theoretical Insights

The authors provide theoretical guarantees for their framework's capability to recover the ground truth preferences. By modeling each preference graph as a random perturbation of an underlying DAG, they prove that their ensemble and denoising strategy can effectively approximate the true preference structure with high probability.

Experimental Results

The efficacy of the PGEaD approach is demonstrated through extensive experiments across multiple benchmark datasets, including HumanEval, MATH, and AlpacaEval. The results consistently show that PGEaD outperforms baseline methods in tasks such as model ranking, response selection, and model alignment. Notably, combining weaker evaluators like Llama3-8B, Mistral-7B, and Qwen2-7B even surpasses stronger models like Qwen2-72B.

In response selection tasks, PGEaD showed an average improvement of 4.51% over baseline methods. In model ranking tasks, PGEaD provided rankings that aligned more closely with ground truth, validating its robustness.

Implications and Future Directions

The implications of this research are significant for both practical and theoretical advancements in AI evaluation. Practically, it offers a method to improve model evaluations without solely relying on strong models, which often entail higher computational costs. Theoretically, the work extends the understanding of ensemble methods in preference evaluation, suggesting that aggregation of weak evaluators can enhance reliability.

Future developments could explore broader applications of the ensemble and denoising methodology in other AI decision-making tasks or extend its applicability to environments with more complex evaluation criteria. Moreover, investigating alternative denoising algorithms beyond FAS minimization could yield further enhancements in evaluation accuracy.

Conclusion

In summary, "LLM Preference Evaluation with Multiple Weak Evaluators" introduces a sophisticated framework that promises more consistent and reliable evaluation of LLM performance through innovative use of preference graph ensemble and denoise techniques. The theoretical and empirical contributions of this paper provide a robust foundation for future improvements in AI evaluation methodologies.