DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

Published 16 Oct 2025 in cs.CL, cs.CV, and cs.LG | (2510.14949v1)

Abstract: Contact languages like English exhibit rich regional variations in the form of dialects, which are often used by dialect speakers interacting with generative models. However, can multimodal generative models effectively produce content given dialectal textual input? In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects. We work with dialect speakers to collect and verify over 4200 unique prompts and evaluate on 17 image and video generative models. Our automatic and human evaluation results show that current state-of-the-art multimodal generative models exhibit 32.26% to 48.17% performance degradation when a single dialect word is used in the prompt. Common mitigation methods such as fine-tuning and prompt rewriting can only improve dialect performance by small margins (< 7%), while potentially incurring significant performance degradation in Standard American English (SAE). To this end, we design a general encoder-based mitigation strategy for multimodal generative models. Our method teaches the model to recognize new dialect features while preserving SAE performance. Experiments on models such as Stable Diffusion 1.5 show that our method is able to simultaneously raise performance on five dialects to be on par with SAE (+34.4%), while incurring near zero cost to SAE performance.

Abstract PDF Upgrade to Chat

Summary

The paper introduces DialectGen, a novel benchmark that quantifies performance drops of 32%–48% in multimodal models when processing dialectal variations.
It details an encoder-based mitigation strategy using dialect learning, polysemy control, and KL regularization to align representations without sacrificing polysemy performance.
Empirical results show a +34.4% improvement in dialect performance on Stable Diffusion 1.5 with minimal SAE loss, demonstrating practical enhancements for equitable multimodal generation.

DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

Motivation and Problem Statement

Multimodal generative models, particularly text-to-image and text-to-video systems, are increasingly deployed in real-world applications where users interact using diverse linguistic varieties. English, as a contact language, exhibits substantial dialectal variation, yet the majority of pretraining corpora and model development pipelines focus on Standard American English (SAE) and, to a lesser extent, British English (BrE). This leads to a systematic underrepresentation of lower-resource dialects, resulting in significant performance disparities and allocational harms for dialect speakers. The paper introduces DialectGen, a large-scale benchmark and mitigation framework to systematically evaluate and improve the dialect robustness of multimodal generative models.

Figure 1: Multimodal generative model outputs on semantically identical prompts differing only in a single synonymous lexical feature in Standard American English (top) and a lower-resource English dialect (bottom).

DialectGen Benchmark: Dataset Construction and Validation

DialectGen comprises over 4,200 rigorously validated prompts spanning six English dialects: SAE, BrE, Chicano English (ChE), Indian English (InE), Singaporean English (SgE), and African American English (AAE). The dataset construction pipeline involves:

Lexeme Selection: Extraction of 1,126 dialectal lexemes and their SAE equivalents from authoritative regional dictionaries.
Prompt Generation: Use of GPT-4o to generate concise (≤6 words) and detailed (≥9 words) visual scene prompts for each lexeme pair, ensuring the dialect lexeme is central to the scene.
Human Validation: Dual annotation by dialect speakers via a custom Amazon Mechanical Turk interface, including a dialect assessment quiz to ensure annotator-dialect alignment.
Figure 2: The Amazon Mechanical Turk data annotation interface for dialect speaker human filtering of generated prompts.

Figure 3: The English Dialect Speaker Assessment Quiz for matching annotators to dialects.

Prompts are retained only if both annotators confirm exact synonymy and lack of ambiguity, especially for polysemous lexemes. This process results in a high-quality, controlled dataset suitable for robust benchmarking.

Empirical Evaluation: Model Robustness Analysis

Seventeen state-of-the-art text-to-image and text-to-video generative models are evaluated on DialectGen using both automatic (VQAScore, CLIPScore) and human alignment metrics. The evaluation protocol computes the relative performance drop when a single dialectal lexeme replaces its SAE counterpart in a prompt.

Key findings:

Severe Performance Degradation: SOTA models exhibit 32.26% to 48.17% performance drops on dialectal prompts, with the most pronounced failures for ChE and InE. BrE, being higher-resource, incurs smaller drops.
Prompt Length Sensitivity: Concise prompts reveal larger robustness gaps than detailed prompts, indicating that models rely heavily on context to disambiguate dialectal terms.
Polysemy Failure Modes: Models systematically default to SAE interpretations of polysemous lexemes, even when contextually implausible.

Baseline Mitigation Strategies and Limitations

The study benchmarks several mitigation strategies:

UNet Fine-tuning: Standard practice in diffusion models, but yields only marginal dialect gains (<7%) and often degrades SAE and polysemy performance.
Prompt Revision/Translation: LLM-based prompt rewriting or translation to SAE improves dialect performance by small margins, but does not address the underlying representation gap.

Encoder-Based Mitigation: Methodology

The core contribution is a general encoder-based mitigation strategy targeting the text encoder of diffusion-based generative models. The approach integrates three loss components:

Dialect Learning Loss ( $\mathcal{L}_{\text{DL}}$ ): Minimizes cosine distance between the target encoder's embedding of a dialect prompt and the frozen encoder's embedding of its SAE synonym, aligning their semantic representations.
Polysemy Control Loss ( $\mathcal{L}_{\text{PC}}$ ): Preserves the model's ability to distinguish polysemous lexemes in SAE contexts by minimizing the distance between target and frozen encoder embeddings for SAE polysemous prompts.
KL Regularization ( $\mathcal{L}_{\text{KL}}$ ): Maintains general SAE performance by aligning the output distribution of the target encoder with the frozen reference encoder using surrogate logits derived from similarity scores in a shared image-text embedding space.
Figure 4: Losses used in the mitigation strategy, with text prompts for Dialect Learning and Polysemy Control from DialectGen and image-caption pairs for KL Regularization from MSCOCO.

Quantitative and Qualitative Results

The encoder-based mitigation strategy achieves:

Substantial Dialect Robustness Gains: On Stable Diffusion 1.5, average dialect performance increases by +34.4%, reaching parity with SAE performance, while incurring <1% drop on SAE benchmarks.
Polysemy Preservation: The Polysemy Control loss is critical for maintaining SAE polysemy performance, with ablation studies showing up to 17.76% improvement.
Superior to Baselines: Both prompt rewriting and UNet fine-tuning are outperformed by the encoder-based approach in all dialects and on polysemous cases.
Figure 5: Qualitative comparison of mitigation strategies using Stable Diffusion 1.5, showing improved alignment and image quality for dialect prompts with the encoder-based method.

Implementation Considerations

Computational Efficiency: Fine-tuning the text encoder with the proposed losses is computationally efficient, requiring less than one hour on a single NVIDIA RTX A6000 GPU for SD1.5, and approximately one hour on four GPUs for SDXL.
Generalizability: The method is compatible with CLIP-style encoders and can be extended to other multimodal architectures with shared embedding spaces.
Resource Requirements: Text-to-video models incur higher computational costs, but the mitigation strategy is designed to be model-agnostic and scalable.

Theoretical and Practical Implications

The findings demonstrate that lexical dialectal variation, rather than grammatical variation, is the primary driver of robustness failures in current multimodal generative models. The encoder-based mitigation approach provides a principled mechanism for aligning dialectal and SAE representations without sacrificing generalization or polysemy handling. This has direct implications for equitable model deployment in linguistically diverse settings and highlights the need for dialect-aware pretraining and evaluation protocols.

Future Directions

Cultural and Representational Biases: Systematic study of representational shifts (e.g., skin tone, environment) induced by dialectal prompts.
Grammatical and Joint Variation: Extension of evaluation to grammatical and multi-lexeme dialectal variations.
Downstream Impact: Assessment of dialectal robustness gaps on downstream tasks and societal outcomes.
Text-to-Video Mitigation: Application and scaling of the mitigation strategy to high-resource text-to-video models.

Conclusion

DialectGen establishes a rigorous benchmark and mitigation framework for dialect robustness in multimodal generative models. The encoder-based mitigation strategy achieves strong dialectal robustness without compromising SAE or polysemy performance, setting a new standard for equitable and inclusive multimodal generation. The work underscores the necessity of dialect-aware model development and provides a reproducible foundation for future research in this domain.

Markdown Report Issue