Papers
Topics
Authors
Recent
Search
2000 character limit reached

BiasInEar Audio Bias Benchmark

Updated 8 February 2026
  • BiasInEar is a comprehensive benchmark that systematically measures linguistic, demographic, and structural bias in multilingual audio language models.
  • It employs controlled experiments with 11,200 synthetic audio questions across English, Chinese, and Korean, incorporating varied accents and genders.
  • Evaluations reveal that model architecture and techniques like CoT prompts significantly enhance robustness and fairness.

The BiasInEar dataset is a comprehensive speech-augmented benchmark designed to systematically measure linguistic, demographic, and structural bias in multilingual audio LLMs (MLLMs). Built on the Global MMLU Lite framework, BiasInEar offers controlled, large-scale evaluation across English, Chinese, and Korean, with balanced coverage of gender and accent, and introduces rigorous quality control and multifactorial experimental protocols. It establishes a unified framework for quantifying fairness and robustness in speech-integrated LLMs, enabling direct comparison with text-based models and revealing nuanced patterns of sensitivity to language, accent, gender, and input structure (Wei et al., 1 Feb 2026).

1. Dataset Design and Composition

BiasInEar extends the Global MMLU Lite subset to the speech modality, including multiple languages, accents, and gender-balanced voices. The dataset comprises 11,200 high-quality multiple-choice questions (400 per language across 28 distinct audio configurations) and totals 70.8 hours (approximately 4,249 minutes) of synthetic speech.

Language, Accent, and Gender Factors

Variable Levels
Language English, Chinese, Korean
Accent English: American, British, Indian;<br>Chinese: Beijing, Northeastern;<br>Korean: Seoul, Jeolla
Gender Female, Male (via two synthetic voices per combo)
Option Order Original, Reversed

Each question is synthesized in every combination of language, accent, gender, and option order. This design allows isolation of both “demographic” (e.g., gender) and “structural” (option order) effects on MLLM behavior. Culturally Sensitive (CS) and Culturally Agnostic (CA) questions are included to interrogate both cross-cultural and neutral-item performance.

2. Construction Methodology

2.1 Spoken Style Question Rendering

Original text items are converted to unambiguous spoken format by GPT OSS 120B using eight explicit conversion rules. Examples include expanding mathematical expressions (“HPO42_4^{2-}” rendered as “hydrogen phosphate”) and clarifying list structures (permutations, placeholders).

2.2 TTS Synthesis and Audio Assembly

Speech audio is generated by the Gemini 2.5 Flash Preview TTS model. For each configuration, the question stem and four options are synthesized separately, combined in the target option order, and prefixed by audio instruction tokens ("question," "A," etc.). All files are standardized as single WAV recordings, with input length constrained to ≤ 30 seconds; longer items are chunked as needed.

2.3 Quality Control

Quality assurance applies both to the rewriting and TTS synthesis stages:

  • Text Rendering QC: Automated text normalization and differencing flags 23–29% of items, with manual inspection confirming a true annotation error rate of approximately 1–5%.
  • TTS QC: Two ASR systems (Whisper Large v3, Omnilingual ASR) automatically score samples by Word Error Rate (WER). For English, over 90% of samples have WER = 0; Chinese and Korean reach 67.6% and 68.8%, respectively. Manual rating shows >80% of sampled clips classified as "Correct" in all languages.
English Chinese Korean
WER = 0 90.7% 67.6% 68.8%
Rated Correct 82.7% 86.0% 75.0%

3. Evaluation Framework and Metrics

BiasInEar employs four complementary metrics to capture accuracy, confidence, stability, and agreement under systematic perturbations:

  • Accuracy: Percentage of questions for which the model’s top prediction matches the gold label.
  • Shannon Entropy (HqH_q): For each model’s output probability distribution over options, normalized with a base-4 log for comparability (Hq[0,1]H_q \in [0,1]).
  • Average Pairwise Entropy Shift (APESqv_q^v): Quantifies how the model’s uncertainty (entropy) changes across levels ll of perturbation variable vv.

APESqv=2L(L1)1i<jLHqliHqlj\mathrm{APES}_q^v = \frac{2}{L(L-1)} \sum_{1\le i<j\le L} | H_q^{l_i} - H_q^{l_j} |

  • Fleiss’ κ\kappa: Measures categorical consistency of answer choices across perturbations; values near 1 indicate near-perfect agreement, 0 random, and below 0 systematic disagreement:

κ=PˉPe1Pe\kappa = \frac{\bar P - P_e}{1 - P_e}

Each metric is selected to isolate a distinct aspect of robustness: overall correctness, probabilistic confidence, stability under condition changes, and label consistency.

4. Experimental Protocol

4.1 Perturbation Dimensions

BiasInEar systematically explores:

  • Linguistic perturbations: full language switches and within-language accents.
  • Demographic perturbations: gender via synthetic male/female voices.
  • Structural perturbations: option order (original vs. reversed), with additional reorderings studied in supplementary analyses.

Each question is thus presented to models in up to 28 configurations, facilitating fine-grained robustness and fairness analyses.

4.2 Model Coverage and Reasoning

Nine MLLMs are evaluated: commercial (Gemini 2.5 Flash/Lite, Gemini 2.0 Flash/Lite), open-source (Gemma 3n E4B/E2B, Voxtral Small/Mini, Phi 4 Multimodal), spanning end-to-end and pipeline architectures. Both standard and chain-of-thought (CoT) reasoning prompts are tested; in some cases, explicit audio-to-transcript-to-answer pipelines are contrasted with direct audio-to-answer inference.

5. Principal Results

5.1 Sensitivity to Perturbations

  • MLLMs display highest robustness (high κ\kappa, low APES) to gender and accent.
  • Language shifts reduce stability (medium κ\kappa, APES).
  • Option order perturbation produces severe instability (negative κ\kappa, high APES), with original order outperforming reversed by 0.5–10 absolute accuracy points across all models.
  • Culturally Agnostic items exhibit consistently lower entropy than Culturally Sensitive ones; option order yields the steepest CS–CA entropy gap.

5.2 Impact of Architecture and Reasoning

  • Larger Gemini variants yield improved robustness (κ\kappa, APES) across most factors.
  • CoT prompts increase κ\kappa by approximately 19–27% and decrease APES by 5–9%.
  • Pipeline design (audio→ASR→LLM) further boosts agreement and confidence stability compared to end-to-end audio processing.

5.3 Speech vs. Text Processing

APES is uniformly higher for audio input than for text input under both language and option order perturbations. This indicates that speech amplifies—without fundamentally altering—sensitivities present in text-based models.

6. Implications and Directions for Future Research

BiasInEar establishes a platform for bridging text and speech model evaluation, supporting multi-factor analyses of model bias. Recommendations include explicitly managing paralinguistic features via transcription, employing complex reasoning prompts (e.g., CoT), increasing model scale and pretraining diversity, and instituting regular audits for audio-specific structural biases. Future work aims to include additional languages/accents, real-world (non-TTS) audio, expanded structural manipulations, and the integration of explicit fairness objectives or adversarial invariance training.

All resources, audio, and evaluation protocols are available at https://github.com/ntunlplab/BiasInEar (Wei et al., 1 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BiasInEar Dataset.