ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues

Published 2 Apr 2026 in cs.CL and cs.AI | (2604.01925v1)

Abstract: LLMs increasingly suppress biased outputs when demographic identity is stated explicitly, yet may still exhibit implicit biases when identity is conveyed indirectly. Existing benchmarks use name based proxies to detect implicit biases, which carry weak associations with many social demographics and cannot extend to dimensions like age or socioeconomic status. We introduce ImplicitBBQ, a QA benchmark that evaluates implicit bias through characteristic based cues, culturally associated attributes that signal implicitly, across age, gender, region, religion, caste, and socioeconomic status. Evaluating 11 models, we find that implicit bias in ambiguous contexts is over six times higher than explicit bias in open weight models. Safety prompting and chain-of-thought reasoning fail to substantially close this gap; even few-shot prompting, which reduces implicit bias by 84%, leaves caste bias at four times the level of any other dimension. These findings indicate that current alignment and prompting strategies address the surface of bias evaluation while leaving culturally grounded stereotypic associations largely unresolved. We publicly release our code and dataset for model providers and researchers to benchmark potential mitigation techniques.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents ImplicitBBQ, a benchmark that evaluates implicit bias in LLMs using culturally salient cues instead of explicit demographic labels.
It shows that implicit cues significantly reduce model accuracy (0.16–0.39) and increase bias scores, revealing stereotype-concordant responses even in disambiguated contexts.
Mitigation strategies like few-shot in-context learning can reduce bias by up to 84%, yet deep-rooted biases in areas such as caste and SES persist.

ImplicitBBQ: Characterizing and Benchmarking Implicit Bias in LLMs

Introduction

This paper introduces ImplicitBBQ, a new benchmark designed to systematically evaluate implicit bias in LLMs via characteristic-based cues rather than explicit demographic labels or name-based proxies (2604.01925). The core premise is that while state-of-the-art LLMs often suppress stereotypic responses when demographics are stated explicitly, they may still express bias when group identity is signaled more subtly—mirroring the distinction between explicit and implicit bias in social psychology.

Traditional bias benchmarking in LLMs, such as BBQ and its successors, focus mainly on explicit demographic specification, sometimes relying on anthroponyms as proxies for implicit identity. However, names are noisy indicators of demographic groups, missing key associations for dimensions such as age and socioeconomic status, and introducing confounds or failing to generalize across cultural boundaries. Moreover, the use of names in prompts leads to ethical concerns and a lack of conceptual clarity regarding the true locus of bias (2604.01925).

Recent work suggests that while explicit bias in LLMs can be greatly reduced via alignment and prompt-based techniques, implicit bias—especially when identities are encoded through indirect signals—is both underexplored and more challenging to mitigate. The present benchmark directly investigates this challenge by crafting a cultural-attribute-based protocol for probing implicit associations, moving beyond simplistic name-based paradigms.

Benchmark Construction

ImplicitBBQ is derived from the established BBQ and BharatBBQ datasets. However, rather than using explicit demographic identifiers (e.g., "Muslim", "woman", "old man"), each context is paired with characteristic-based cues—culturally salient attributes (e.g., "prays namaz", "has wrinkles")—that act as implicit demographic signals.

These candidate cues are mined from ConceptNet via targeted relations, filtered to exclude direct references, and then validated through a rigorous human annotation pipeline ensuring consistency and cross-cultural validity (Fleiss' kappa ≥ 0.60). The final benchmark comprises ~8k QA instances spanning six demographic dimensions: age, gender, region, religion, caste, and socioeconomic status (SES).

The question structure is inherited from BBQ and BharatBBQ: each item occurs in both ambiguous (insufficient information—correct answer: "Unknown") and disambiguated (enough information—correct answer is indicated in context) variants, with both negative and non-negative target questions. Control settings (neutral characteristic cues, or no cues at all) are used to ensure that observed effects are genuinely driven by the demographic associations in cues.

Experimental Protocol

The study evaluates eleven LLMs, including eight open-weight models (e.g., Mistral, Llama, Qwen, GPT-OSS) and three closed-source models (Claude 4.5 Haiku, Gemini-3.1-Flash-Lite, GPT-5-mini). Models are tested under six conditions (explicit/implicit, ambiguous/disambiguous, and two controls).

For mitigation, three strategies are examined:

Safety prompting: Add explicit instructions to avoid bias.
Few-shot in-context learning (ICL): Provide multiple exemplars illustrating unbiased reasoning.
Chain-of-Thought (CoT): Require step-by-step reasoning prior to final answer.

Results: Explicit vs Implicit Bias

In explicit settings, all open-weight models perform with high accuracy (0.76–0.98) and minimal bias (mean bias < 0.08) on ambiguous contexts. In disambiguated contexts, performance is near-perfect and bias scores cluster close to zero—confirming that current LLMs substantially suppress explicit bias when directly cued.

However, the introduction of implicit (characteristic-based) cues reveals markedly higher bias: for ambiguous contexts, accuracy plunges (0.16–0.39) and bias scores are 6–10× higher than in explicit settings (mean bias ≈ 0.32). In disambiguated contexts, while accuracy mostly recovers, models still show a small but consistently stereotype-concordant directional bias.

Closed-weight models behave differently: accuracy drops on ambiguous implicit prompts, but with less systematic stereotypic preference relative to open models.

The control conditions rule out spurious responses due to prompt structure, as LLMs are both accurate and unbiased when cue content is semantically neutral.

Figure 1: Llama-3.1-8B-Instruct provides stereotypical answers under implicit cultural cue prompting but refuses or abstains with explicit demographic mentions or name proxies.

Figure 2: Under implicit prompting, Llama-3.1-8B-Instruct defaults to stereotypical responses, even when context is ambiguous or evidence is disambiguated.

Demographic Analysis

Bias is not evenly distributed across dimensions. Caste shows the highest implicit ambiguous bias scores (mean ≈ 0.59, up to 0.74 in some models), and evidence of residual explicit bias even in disambiguated contexts. This is a critical finding, given that caste bias has historically received less scrutiny than gender or race.

SES, Age, Religion, and Region display intermediate implicit bias (mean 0.20–0.42). Gender, although heavily studied in prior work, demonstrates the lowest overall bias—though significant residual bias persists in some models under implicit evaluation. There is no reliable correlation between model size and implicit bias: larger models are not consistently less susceptible.

Figure 3: Bias scores by demographic and context for all models—caste is consistently the most bias-prone dimension.

Efficacy of Prompt-based Mitigations

Mitigation effectiveness diverges significantly across strategies and dimensions:

Safety prompting reduces mean implicit bias by ~60%, but with strong model dependence.
Few-shot ICL is most effective (bias reduction ≈ 84%, mean residual bias 0.05), but cannot fully eliminate implicit bias, especially for caste (where it remains >4× other dimensions).
CoT prompting gives mixed results, generally less effective than few-shot and sometimes counterproductive.

Of particular note: for all strategies, the explicit–implicit bias gap narrows but never closes. Moreover, mitigation resistance in caste implicates deeper encoded associations, likely not addressable through prompt-level adaptation alone.

Figure 4: Efficacy comparison of mitigation strategies, showing few-shot ICL as most effective overall but with residual caste bias.

Methodological and Practical Implications

ImplicitBBQ demonstrates that the apparent bias mitigation success achieved by LLMs under explicit (or even name-proxy-based) evaluation is largely superficial. These models still encode and express significant implicit (cultural-cue-based) bias, especially toward less-studied dimensions such as caste.

Practically, this refutes evaluation protocols that rely solely on explicit exposure as a test of LLM alignment. Alignment pipelines and industry safety checks must account for the persistent presence of such hidden biases. It is clear that robust mitigation requires intervention at model architecture or data curation level rather than prompt engineering alone.

Theoretically, these results call for a reconceptualization of how demographic information is encoded, signaled, and decoded in generative models. Extension of ImplicitBBQ to disability, race, and finer sub-demographics is a necessary future step.

Conclusion

ImplicitBBQ fills a crucial gap in the bias evaluation landscape by enabling rigorous, scalable, and contextually rich probing of implicit, culturally-grounded bias. The findings demonstrate that prompting strategies—even when carefully tuned—cannot fully resolve implicit bias, especially regarding caste and SES, and that alignment via explicit label suppression gives a false sense of safety. Addressing implicit bias in LLMs will require fundamentally deeper mitigation methods at pretraining and fine-tuning stages, and continuous, nuanced evaluation using comprehensive and culturally relevant benchmarks such as ImplicitBBQ.

Markdown Report Issue