- The paper finds that Newcastle dialect features such as phonological variations and local lexical items significantly drive ASR errors, more so than social factors.
- A two-stage analysis using both manual coding and automated metrics, including a 32% WER benchmark for Rev AI, was employed to evaluate four ASR systems on a DECTE subsample.
- The findings underscore the need for ASR systems to incorporate regional dialects into training data to mitigate bias and enhance recognition fairness.
Error Analysis of ASR Biases in Newcastle English
This paper delivers a detailed error analysis of automatic speech recognition (ASR) performance on Newcastle English, addressing the underexplored area of regional dialect bias in ASR systems. Using the Diachronic Electronic Corpus of Tyneside English (DECTE) and a two-stage analysis combining manual and automated methods, the authors dissect how dialectal variation, rather than social parameters (such as age or gender), predominantly drives ASR errors.
Methodological Framework
The evaluation begins with a comparison of four ASR systems on a 10% DECTE subsample: Google Cloud Speech-to-Text, Deepgram Voice AI, CrisperWhisper, and Rev AI. Through WER benchmarking, Rev AI (configured for UK English) outperforms all, achieving a WER of approximately 32% over the full dataset and is therefore selected for further analysis.
Error classification is systematically structured. The analysis distinguishes errors as phonological, lexical, morphosyntactic, standardisation, and spelling, then sub-categorizes them based on specific dialectal phenomena. This granularity enables identification of precise points of failure related to Newcastle English features. For morphosyntactic analysis, the study employs custom Python scripts to automate the extraction of misrecognitions focusing on salient regional pronouns ("yous," "wor"), and follows with mixed-effects logistic regression to assess the influence of demographic and acoustic variables.
Empirical Findings
Quantitative and Qualitative Error Breakdown:
The data indicate that phonological and lexical features, particularly monophthongisation, glottalisation, vowel quality, and local lexical items (e.g., "nowt", "owt") are the primary sources of ASR errors. Near glottalisation of /t/ alone accounts for 23% of phonological errors, and monophthongisation of FACE vowels for 17.5%. Morphosyntactic errors are heavily concentrated around nonstandard pronouns (notably "yous" and "wor") and verb paradigms unique to Newcastle English, with standard ASR models often defaulting to the southern British standards or producing nonsensical substitutions.
Analysis of Social Factors:
Results show that social factors (gender, age) have smaller but non-negligible effects compared to dialectal ones. Male speakers, who tend to use more nonstandard local forms due to covert prestige, incur a higher rate of lexical and certain phonological errors. Age-based analysis uncovers a non-monotonic pattern: younger and older speakers—groups more likely to use nonstandard forms due to sociolinguistic age grading—are more prone to ASR errors, particularly on lexical items and regional pronouns. The regression analysis on "wor" recognition highlights that only the age group 21–40 achieves a statistically significant improvement in recognition accuracy, while higher noise levels consistently reduce accuracy.
Standardisation and Spelling:
ASR systems consistently attempt to "standardise" regional features, converting "me life" to "my life" and "telly" to "television," which not only misrepresents speaker intent but contributes to loss of dialectal information. Spelling errors, while present, are primarily transatlantic substitutions and relatively infrequent.
Implications
For ASR Engineering:
The results have concrete implications for ASR system development and deployment. Existing ASR models, even those considered state-of-the-art, are shown to be trained on data insufficiently representative of dialectal diversity. Failure to address dialectal variation perpetuates both technological exclusion and linguistic marginalisation. Incorporation of Newcastle English (and by extension, other regional dialects) into training datasets is essential for equitable ASR performance.
Automated, linguistically informed dialectal feature tagging will be necessary for scalable, fine-grained error diagnosis and for the targeted augmentation of training data. Additionally, the analysis reinforces recent arguments to move beyond word error rate (WER) as the sole performance metric; qualitative and context-aware metrics are crucial when evaluating ASR fairness and representativeness.
Sociolinguistic Insights:
From a sociolinguistic perspective, the findings confirm that ASR biases are systematically linked to well-studied regional linguistic features. The impact of age grading, lexical innovation, and covert prestige—phenomena extensively described in sociolinguistics—now bear directly on speech technology outcomes. This invites further cross-disciplinary research at the ASR/sociolinguistics interface, particularly in modeling, annotation strategies, and performance evaluation.
Future Research Directions
- Dialectal Data Diversification: Curate and open-source larger, systematically annotated datasets of regional speech for ASR training and evaluation.
- Feature-based Model Adaptation: Incorporate explicit dialectal features into acoustic and LLM inputs; consider feature-aware transfer learning and pretraining approaches.
- Automated Dialectal Error Tagging: Develop robust, scalable tools for error type classification leveraging recent progress in speech and text alignment, as well as dialect identification research.
- Beyond WER Evaluation: Advance new fairness and utility metrics aligned with sociolinguistic diversity and community needs.
- End-user Personalisation: Investigate adaptive ASR approaches that can conditionally switch or fine-tune models according to speaker dialect, possibly in real time.
Conclusions
This study substantiates that current ASR systems exhibit systematic recognition failures on Newcastle English, with errors tightly coupled to regionally marked phonological, lexical, and morphosyntactic features. The analysis underscores the importance of dialectal representation in ASR training and the necessity of integrating sociolinguistic methodologies into speech technology evaluation pipelines. Future ASR fairness and accuracy improvements are likely to emerge from greater cross-pollination between engineering and sociolinguistic traditions, combined with concerted efforts in data collection, error analysis, and model adaptation.