Evaluating Large Language Models through Gender and Racial Stereotypes

Published 24 Nov 2023 in cs.CL, cs.AI, and cs.CY | (2311.14788v1)

Abstract: LLMs have ushered a new age of AI gaining traction within the NLP community as well as amongst the general population. AI's ability to make predictions, generations and its applications in sensitive decision-making scenarios, makes it even more important to study these models for possible biases that may exist and that can be exaggerated. We conduct a quality comparative study and establish a framework to evaluate LLMs under the premise of two kinds of biases: gender and race, in a professional setting. We find out that while gender bias has reduced immensely in newer models, as compared to older ones, racial bias still exists.

Abstract PDF HTML Upgrade to Chat

References (15)

Citations (2)

View on Semantic Scholar

Summary

The paper evaluates LLM performance in gender assignment using bias scores, finding that GPT-3.5 exhibits the least bias among evaluated models.
It assesses racial bias by analyzing AI-generated descriptions with LIWC, uncovering persistent stereotypes across various professions.
The study emphasizes that mitigating biases in LLMs is crucial for ensuring fair, unbiased decision-making in sensitive professional contexts.

Introduction

LLMs have become integral to various applications in sensitive decision-making scenarios, making the presence of biases within them a significant concern. Two primary biases evaluated in this research are gender and race within a professional context. These biases could potentially affect outcomes and perpetuate societal stereotypes if not addressed effectively. The study utilizes a dataset of 99 professions to assess whether models exhibit biases when assigning gender or race to these professions. While gender bias appears to be on the decline, racial bias still persists in LLMs.

Methodology

The study employs a two-pronged approach: one for gender and another for racial bias. Gender bias is tested by tasking models with assigning a gender to different professions, comparing the results against human-annotated ground truth. The evaluation covers both older models (like BERT, GPT-2) and newer ones (like GPT-3.5 and Claude). Racial bias is assessed by generating descriptions for individuals of various races in different professions and analyzing the responses for stereotypes. The study operationalizes societal biases as varied accuracies in judgment based on gender, race, and social status.

Gender Analysis

Investigating gender bias, the study finds that newer models like GPT-3.5 show improvement over older versions, with a substantial reduction in gender bias. However, challenges remain as models like Flan-T5 exhibit significant biases, failing to embrace recent shifts towards gender neutrality in professions. Metrics such as bias score are used to compare model performances, showing that GPT-3.5 exhibits the least bias among the evaluated models. The research highlights that while advancements are evident, the path to completely unbiased AI representations of gender in professions is still unfolding.

Race Analysis

In assessing racial bias, GPT-3.5 generates descriptions that adhere to stereotypes for different races across various professions. By measuring the similarity of responses and employing a Linguistic Inquiry and Word Count (LIWC) analysis, the study shows noticeable differences in the emotional, social, and work-related attributes ascribed to different races. These inconsistencies reveal implicit biases where certain races are depicted with more emotive descriptors or differing attitudes towards work and social interactions.

Conclusion

The evaluation framework developed and applied in this study demonstrates that, despite improvements, LLMs such as GPT-3.5 still exhibit biases related to gender and race. The research underlines the importance of continued efforts to mitigate these biases, suggesting that future studies could broaden the analysis to include other models and evaluate the impact of biases on human behavior more directly. The study contributes to the critical discourse on creating fairer AI systems by providing a method to identify and measure the subtle prejudices that could influence real-world decisions.

Markdown