Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models

Published 7 Aug 2024 in cs.CL and cs.AI | (2408.03907v1)

Abstract: LLMs have excelled at language understanding and generating human-level text. However, even with supervised training and human alignment, these LLMs are susceptible to adversarial attacks where malicious users can prompt the model to generate undesirable text. LLMs also inherently encode potential biases that can cause various harmful effects during interactions. Bias evaluation metrics lack standards as well as consensus and existing methods often rely on human-generated templates and annotations which are expensive and labor intensive. In this work, we train models to automatically create adversarial prompts to elicit biased responses from target LLMs. We present LLM- based bias evaluation metrics and also analyze several existing automatic evaluation methods and metrics. We analyze the various nuances of model responses, identify the strengths and weaknesses of model families, and assess where evaluation methods fall short. We compare these metrics to human evaluation and validate that the LLM-as-a-Judge metric aligns with human judgement on bias in response generation.

Abstract PDF HTML Upgrade to Chat

Citations (6)

View on Semantic Scholar

Summary

The paper introduces an automated method using adversarial prompt generation to elicit and measure gender bias in LLM responses.
It employs a pipeline with Attacker, Target, and Evaluator LLMs, comparing metrics like toxicity and bias scores for precise evaluation.
Human evaluations confirm that larger Llama2 models exhibit less bias, aligning LLM-as-a-Judge scores with human judgments.

Automated Gender Bias Detection in LLMs

This paper introduces an automated methodology for detecting gender bias in LLMs using adversarial prompt generation and LLM-as-a-Judge evaluation. The authors address the critical need for systematic bias assessment in LLMs, highlighting the limitations of existing metrics and the challenges of human evaluation. They propose a pipeline involving an Attacker LLM, a Target LLM, and an Evaluator, focusing on binary gender bias but suggesting extensibility to other protected attributes.

Methodology

The bias detection workflow (Figure 1) begins with an Attacker LLM that generates adversarial prompts designed to elicit biased responses from a Target LLM.

Figure 1: Bias in LLM responses: Examples of LLMs exhibiting biased (left) and unbiased (right) responses.

The authors employed Meta's Llama3-8B Instruct model, both pre-trained and fine-tuned using LoRA, to create these prompts, incorporating counterfactual data augmentation to generate gender-paired prompts. The Target LLMs evaluated included the Llama2-chat family, GPT-4, Mixtral 8x7B Instruct-v0.1, and Mistral 7B Instruct-v0.2. The responses from the Target LLMs were then evaluated using several metrics, including Perspective API for toxicity, VADER for sentiment analysis, Regard for societal perceptions, LlamaGuard2 for safety, and an OpenAI Compliance Annotator. The core of their evaluation is the LLM-as-a-Judge paradigm, where GPT-4o scores the responses for bias on a scale from 0 (no bias) to 5 (extreme bias), providing explanations for the classifications.

Key Contributions

A primary contribution of this work is the use of adversarial prompt generation to systematically uncover biases in LLMs. The authors demonstrate that automatically generated prompts can be more effective at eliciting biased responses compared to human-written prompts. They also provide a comparative assessment of various bias evaluation metrics, revealing inconsistencies and highlighting the importance of selecting appropriate metrics for the task. The introduction of the LLM-as-a-Judge paradigm, leveraging the strong human alignment of LLMs, is another significant contribution. The authors validate this approach through extensive human evaluations, showing that the LLM-as-a-Judge metric aligns most accurately with human annotations for identifying and measuring bias.

Experimental Results

The study's findings (Table 1) indicate that Perspective API scores for Identity Attack, Insult, and Toxicity are significantly higher for female-oriented responses across various models.

Figure 2: Bias Detection Workflow. The Attacker LLM synthesizes adversarial prompts for Target LLMs. Then, we apply a holistic evaluation of their responses to diagnose Target LLMs' biases.

Additionally, Mixtral, Mistral, and GPT-4 models exhibit lower average Identity Attack and Toxicity scores compared to the Llama2 model family, suggesting that female-oriented responses may require more alignment and correction for these metrics. Interestingly, Mixtral 8x7B Inst shows a significantly lower sentiment score for female responses, implying more critical responses with negative polarity for the female gender. The LLM-as-a-Judge metric reveals the highest mean bias scores in the Mistral 7B Inst model. The sentiment difference and bias score difference from LLM-as-a-Judge decrease with increasing model size within the Llama2 family, reinforcing the notion that larger Llama2 models are fairer.

Human Evaluation and Bias Analysis

The authors conducted human evaluations using Amazon Mechanical Turk (AMT), focusing on individual prompt-response pairs and comparing gender-paired prompts and responses. The inter-annotator agreement was found to be quite low, underscoring the subjective nature of bias assessment. The overall bias analysis (Table 2) revealed that the Llama2-7b-chat model exhibited the highest bias, aligning with the LLM-judge Gap score. All metrics consistently showed decreasing overall bias with increasing model size within the Llama model family.

Figure 3: LLM-as-a-Judge Bias Score Comparison

The LLM-judge Gap score aligns with human judgment of bias, whereas the Sentiment Gap score shows inconsistencies. The results highlight the complexity of bias detection and the importance of carefully selecting evaluation metrics.

Implications and Future Work

This research has significant implications for the development and deployment of fairer LLMs. The automated bias detection pipeline and the LLM-as-a-Judge paradigm offer practical tools for identifying and mitigating biases in model responses. The findings emphasize the need for standardized bias metrics and comprehensive human studies to understand societal and cultural nuances in bias perception. The authors acknowledge the limitations of human evaluations due to inherent biases and suggest future work should explore these issues by expanding the study to other types of biases and protected classes, conditioning on the biases of human evaluators.

Conclusion

The study effectively demonstrates a comprehensive approach to identifying and measuring gender bias in LLMs, highlighting the challenges in existing metrics and introducing a more human-aligned LLM-as-a-Judge paradigm. The use of adversarial prompt generation and detailed comparative analysis provide valuable insights for developing more equitable and reliable LLMs.