Self-Preference Bias in LLM-as-a-Judge

Published 29 Oct 2024 in cs.CL | (2410.21819v2)

Abstract: Automated evaluation leveraging LLMs, commonly referred to as LLM evaluators or LLM-as-a-judge, has been widely used in measuring the performance of dialogue systems. However, the self-preference bias in LLMs has posed significant risks, including promoting specific styles or policies intrinsic to the LLMs. Despite the importance of this issue, there is a lack of established methods to measure the self-preference bias quantitatively, and its underlying causes are poorly understood. In this paper, we introduce a novel quantitative metric to measure the self-preference bias. Our experimental results demonstrate that GPT-4 exhibits a significant degree of self-preference bias. To explore the causes, we hypothesize that LLMs may favor outputs that are more familiar to them, as indicated by lower perplexity. We analyze the relationship between LLM evaluations and the perplexities of outputs. Our findings reveal that LLMs assign significantly higher evaluations to outputs with lower perplexity than human evaluators, regardless of whether the outputs were self-generated. This suggests that the essence of the bias lies in perplexity and that the self-preference bias exists because LLMs prefer texts more familiar to them.

Abstract PDF HTML Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a novel metric to quantify self-preference bias by comparing LLM evaluators' tendencies to favor their own outputs over others.
It demonstrates that lower output perplexity significantly increases bias, with models such as GPT-4 exhibiting pronounced self-favoritism compared to human judgments.
The study suggests that ensemble evaluations could mitigate these biases, paving the way for fairer and more balanced automated assessment systems.

Self-Preference Bias in LLM-as-a-Judge

Introduction

The study of automated evaluation mechanisms using LLMs, known as LLM evaluators or LLM-as-a-judge, has been gaining traction due to their ability to provide scalable and consistent assessments in dialogue systems. However, a significant limitation of this approach is the self-preference bias, where LLMs tend to favor their own outputs over those generated by others. This paper introduces a novel quantitative metric to assess this bias and investigate its causes, emphasizing the pivotal role of output perplexity in this phenomenon.

Quantifying Self-Preference Bias

The study proposes a metric grounded in algorithmic fairness concepts to measure self-preference bias in LLMs during pairwise evaluations. The metric compares the likelihood that an LLM gives preference to its own output over others, aligning this with human preferences. Specifically, the bias is represented as the difference in the probability of an evaluator favoring its own response versus another response, when both are rated favorably by humans.

Figure 1: Confusion matrix for each LLM evaluator's assessment of pairs including its own output. It suggests that some LLMs, including GPT-4, have relatively high true positive rates compared to true negative rates, indicating self-preference bias.

The experimental setup involves using a dataset from Chatbot Arena with LLMs such as GPT-3.5-Turbo, GPT-4, Vicuna-13b, and others. The primary task is to evaluate two responses to a user query, analyzing evaluation scores to determine the degree of bias.

Results

Empirical analysis reveals that GPT-4 exhibits a pronounced self-preference bias. When evaluated against human judgments, GPT-4 favored responses it generated significantly more than human evaluators did. A notable finding is that other models like Vicuna-13b and Koala-13b also display this bias, whereas some models like oasst-pythia-12b and dolly-v2-12b exhibit reverse bias, underestimating their own outputs.

Impact of Perplexity on Bias

A major hypothesis explored in this research is the impact of text familiarity, measured through perplexity, on self-preference bias. LLM evaluators are found to assign higher scores to responses with lower perplexity, which are more familiar or predictable to them. This behavior was consistent across most LLMs evaluated, suggesting a systemic bias rooted in training data familiarity rather than response quality alone.

Figure 2: LLMs vs human conditioned on perplexity. Winning judgment rates by LLMs conditioned on perplexity with human winning judgment rates are plotted. All models except dolly-v2-12b and stablelm-tuned-alpha-7b demonstrated a clear tendency to assign higher evaluations to responses with lower perplexity.

The perplexity-based bias was observed irrespective of whether the response was self-generated or produced by other models, revealing this as a broader issue not confined to merely recognizing one's own outputs.

Discussion

The findings illuminate potential avenues to mitigate self-preference bias, such as employing ensemble evaluations across multiple models to prevent any single model's peculiarities from unduly influencing overall judgments. Additionally, these results underscore the necessity to account for model familiarity bias when designing LLM evaluators, to avoid perpetuating inherent model biases in automated assessments.

Conclusion

This research presents a comprehensive approach to quantifying and understanding self-preference bias in LLM evaluators. By linking this bias to perplexity, it opens pathways for addressing such biases in future LLM development and deployment. The proposed metric provides a foundation for further exploration and potential rectification of biases in LLM-as-a-judge scenarios, ensuring fairer and more balanced automated evaluation systems.