BiasInEar Audio Bias Benchmark

Updated 8 February 2026

BiasInEar is a comprehensive benchmark that systematically measures linguistic, demographic, and structural bias in multilingual audio language models.
It employs controlled experiments with 11,200 synthetic audio questions across English, Chinese, and Korean, incorporating varied accents and genders.
Evaluations reveal that model architecture and techniques like CoT prompts significantly enhance robustness and fairness.

The BiasInEar dataset is a comprehensive speech-augmented benchmark designed to systematically measure linguistic, demographic, and structural bias in multilingual audio LLMs (MLLMs). Built on the Global MMLU Lite framework, BiasInEar offers controlled, large-scale evaluation across English, Chinese, and Korean, with balanced coverage of gender and accent, and introduces rigorous quality control and multifactorial experimental protocols. It establishes a unified framework for quantifying fairness and robustness in speech-integrated LLMs, enabling direct comparison with text-based models and revealing nuanced patterns of sensitivity to language, accent, gender, and input structure (Wei et al., 1 Feb 2026).

1. Dataset Design and Composition

BiasInEar extends the Global MMLU Lite subset to the speech modality, including multiple languages, accents, and gender-balanced voices. The dataset comprises 11,200 high-quality multiple-choice questions (400 per language across 28 distinct audio configurations) and totals 70.8 hours (approximately 4,249 minutes) of synthetic speech.

Language, Accent, and Gender Factors

Variable	Levels
Language	English, Chinese, Korean
Accent	English: American, British, Indian;<br>Chinese: Beijing, Northeastern;<br>Korean: Seoul, Jeolla
Gender	Female, Male (via two synthetic voices per combo)
Option Order	Original, Reversed

Each question is synthesized in every combination of language, accent, gender, and option order. This design allows isolation of both “demographic” (e.g., gender) and “structural” (option order) effects on MLLM behavior. Culturally Sensitive (CS) and Culturally Agnostic (CA) questions are included to interrogate both cross-cultural and neutral-item performance.

2. Construction Methodology

2.1 Spoken Style Question Rendering

Original text items are converted to unambiguous spoken format by GPT OSS 120B using eight explicit conversion rules. Examples include expanding mathematical expressions (“HPO $_4^{2-}$ ” rendered as “hydrogen phosphate”) and clarifying list structures (permutations, placeholders).

2.2 TTS Synthesis and Audio Assembly

Speech audio is generated by the Gemini 2.5 Flash Preview TTS model. For each configuration, the question stem and four options are synthesized separately, combined in the target option order, and prefixed by audio instruction tokens ("question," "A," etc.). All files are standardized as single WAV recordings, with input length constrained to ≤ 30 seconds; longer items are chunked as needed.

2.3 Quality Control

Quality assurance applies both to the rewriting and TTS synthesis stages:

Text Rendering QC: Automated text normalization and differencing flags 23–29% of items, with manual inspection confirming a true annotation error rate of approximately 1–5%.
TTS QC: Two ASR systems (Whisper Large v3, Omnilingual ASR) automatically score samples by Word Error Rate (WER). For English, over 90% of samples have WER = 0; Chinese and Korean reach 67.6% and 68.8%, respectively. Manual rating shows >80% of sampled clips classified as "Correct" in all languages.

	English	Chinese	Korean
WER = 0	90.7%	67.6%	68.8%
Rated Correct	82.7%	86.0%	75.0%

3. Evaluation Framework and Metrics

BiasInEar employs four complementary metrics to capture accuracy, confidence, stability, and agreement under systematic perturbations:

Accuracy: Percentage of questions for which the model’s top prediction matches the gold label.
Shannon Entropy ( $H_q$ ): For each model’s output probability distribution over options, normalized with a base-4 log for comparability ( $H_q \in [0,1]$ ).
Average Pairwise Entropy Shift (APES $_q^v$ ): Quantifies how the model’s uncertainty (entropy) changes across levels $l$ of perturbation variable $v$ .

$\mathrm{APES}_q^v = \frac{2}{L(L-1)} \sum_{1\le i<j\le L} | H_q^{l_i} - H_q^{l_j} |$

Fleiss’ $\kappa$ : Measures categorical consistency of answer choices across perturbations; values near 1 indicate near-perfect agreement, 0 random, and below 0 systematic disagreement:

$\kappa = \frac{\bar P - P_e}{1 - P_e}$

Each metric is selected to isolate a distinct aspect of robustness: overall correctness, probabilistic confidence, stability under condition changes, and label consistency.

4. Experimental Protocol

4.1 Perturbation Dimensions

BiasInEar systematically explores:

Linguistic perturbations: full language switches and within-language accents.
Demographic perturbations: gender via synthetic male/female voices.
Structural perturbations: option order (original vs. reversed), with additional reorderings studied in supplementary analyses.

Each question is thus presented to models in up to 28 configurations, facilitating fine-grained robustness and fairness analyses.

4.2 Model Coverage and Reasoning

Nine MLLMs are evaluated: commercial (Gemini 2.5 Flash/Lite, Gemini 2.0 Flash/Lite), open-source (Gemma 3n E4B/E2B, Voxtral Small/Mini, Phi 4 Multimodal), spanning end-to-end and pipeline architectures. Both standard and chain-of-thought (CoT) reasoning prompts are tested; in some cases, explicit audio-to-transcript-to-answer pipelines are contrasted with direct audio-to-answer inference.

5. Principal Results

5.1 Sensitivity to Perturbations

MLLMs display highest robustness (high $\kappa$ , low APES) to gender and accent.
Language shifts reduce stability (medium $\kappa$ , APES).
Option order perturbation produces severe instability (negative $\kappa$ , high APES), with original order outperforming reversed by 0.5–10 absolute accuracy points across all models.
Culturally Agnostic items exhibit consistently lower entropy than Culturally Sensitive ones; option order yields the steepest CS–CA entropy gap.

5.2 Impact of Architecture and Reasoning

Larger Gemini variants yield improved robustness ( $\kappa$ , APES) across most factors.
CoT prompts increase $\kappa$ by approximately 19–27% and decrease APES by 5–9%.
Pipeline design (audio→ASR→LLM) further boosts agreement and confidence stability compared to end-to-end audio processing.

5.3 Speech vs. Text Processing

APES is uniformly higher for audio input than for text input under both language and option order perturbations. This indicates that speech amplifies—without fundamentally altering—sensitivities present in text-based models.

6. Implications and Directions for Future Research

BiasInEar establishes a platform for bridging text and speech model evaluation, supporting multi-factor analyses of model bias. Recommendations include explicitly managing paralinguistic features via transcription, employing complex reasoning prompts (e.g., CoT), increasing model scale and pretraining diversity, and instituting regular audits for audio-specific structural biases. Future work aims to include additional languages/accents, real-world (non-TTS) audio, expanded structural manipulations, and the integration of explicit fairness objectives or adversarial invariance training.

All resources, audio, and evaluation protocols are available at https://github.com/ntunlplab/BiasInEar (Wei et al., 1 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Bias in the Ear of the Listener: Assessing Sensitivity in Audio Language Models Across Linguistic, Demographic, and Positional Variations (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BiasInEar Dataset.

BiasInEar Audio Bias Benchmark

1. Dataset Design and Composition

Language, Accent, and Gender Factors

2. Construction Methodology

2.1 Spoken Style Question Rendering

2.2 TTS Synthesis and Audio Assembly

2.3 Quality Control

3. Evaluation Framework and Metrics

4. Experimental Protocol

4.1 Perturbation Dimensions

4.2 Model Coverage and Reasoning

5. Principal Results

5.1 Sensitivity to Perturbations

5.2 Impact of Architecture and Reasoning

5.3 Speech vs. Text Processing

6. Implications and Directions for Future Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

BiasInEar Audio Bias Benchmark

1. Dataset Design and Composition

Language, Accent, and Gender Factors

2. Construction Methodology

2.1 Spoken Style Question Rendering

2.2 TTS Synthesis and Audio Assembly

2.3 Quality Control

3. Evaluation Framework and Metrics

4. Experimental Protocol

4.1 Perturbation Dimensions

4.2 Model Coverage and Reasoning

5. Principal Results

5.1 Sensitivity to Perturbations

5.2 Impact of Architecture and Reasoning

5.3 Speech vs. Text Processing

6. Implications and Directions for Future Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research