Causal origins of implicit bias in large language models

Investigate and determine why the implicit biases elicited by characteristic-based cues in the ImplicitBBQ benchmark arise in large language models, specifically by tracing and quantifying the contributions of pretraining data, supervised fine-tuning, and alignment procedures.

Background

The paper introduces ImplicitBBQ, a question-answering benchmark that measures implicit bias in LLMs using characteristic-based cues across six demographic dimensions (age, gender, region, religion, caste, and socioeconomic status). Across evaluations of 11 models, the authors find that implicit bias in ambiguous contexts is substantially higher than explicit bias, and that mitigation strategies such as safety prompting, few-shot prompting, and chain-of-thought reduce but do not eliminate this gap. Caste-related bias is especially persistent and mitigation-resistant.

While the study quantifies the presence and degree of implicit bias, it does not identify its underlying causes in model development pipelines. The authors explicitly note that they did not trace biases to pretraining data, fine-tuning, or alignment, leaving open the question of which stages and mechanisms most contribute to the observed implicit biases. Addressing this would clarify where and how to intervene to mitigate such biases effectively.

References

Our study identifies how much implicit bias manifests but does not trace it to pretraining data, fine-tuning, or alignment procedures: understanding why these biases arise remains open.

ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues  (2604.01925 - Vedula et al., 2 Apr 2026) in Section 7 (Limitations and Future Work)