Papers
Topics
Authors
Recent
Search
2000 character limit reached

WenetSpeech-Wu-Bench: Wu Dialect Speech Benchmark

Updated 23 January 2026
  • WenetSpeech-Wu-Bench is a standardized benchmark that provides unified evaluation protocols and baseline models for multiple Wu dialect tasks including ASR, AST, TTS, speaker and emotion recognition.
  • It offers detailed test sets and balanced training subsets for six key tasks, addressing challenges such as tone sandhi, code-mixing, and dialectal variations.
  • Baseline models, including advanced multi-task learners, demonstrate superior performance in low-resource settings, underscoring the value of dialect-specific data.

WenetSpeech-Wu-Bench is the first standardized, publicly accessible benchmark for systematic evaluation of Chinese Wu dialect speech processing. It provides unified protocols, multidimensional test sets, and baseline models for Automatic Speech Recognition (ASR), Wu-to-Mandarin Automatic Speech Translation (AST), speaker attribute prediction (gender, age), speech emotion recognition, text-to-speech (TTS) synthesis, and controllable TTS (instruct TTS). Developed in conjunction with the WenetSpeech-Wu corpus—the first large-scale, multi-annotated Wu dialect dataset—WenetSpeech-Wu-Bench enables rigorous evaluation and empirical comparison across dialect-specific tasks, addressing persistent challenges in low-resource, dialectal speech technology (Wang et al., 16 Jan 2026).

1. Benchmark Design and Task Coverage

WenetSpeech-Wu-Bench comprises six principal tasks, each chosen to reflect a core dimension of Wu dialect speech processing:

  1. Automatic Speech Recognition (ASR): Transcription of Wu-dialect speech (including Shanghainese, Suzhounese, and code-mixing with Mandarin) presents distinctive challenges due to complex tone sandhi, voiced–voiceless consonant preservation, and sub-dialectal variation.
  2. Wu-to-Mandarin Automatic Speech Translation (AST): Translation tasks necessitate handling lexical gaps, segmental and tonal mappings between dialect and standard Mandarin, and preserving semantic and fluency characteristics.
  3. Speaker Attribute Prediction: Infers gender and age group from audio. This is complicated by dialectal prosody and diverging formant statistics relative to Mandarin.
  4. Speech Emotion Recognition: Utterances are classified into five emotions—Neutral, Happy, Sad, Angry, Surprised. Emotional manifestation differs acoustically and semantically from Mandarin, compounded by scarcity of single-speaker emotional data.
  5. Text-to-Speech (TTS) Synthesis: Models must generate natural, intelligible Wu speech given text. Dialectal prosody, tone sandhi, and speaker variation require robust data and modeling strategies.
  6. Instruction-Following TTS (Instruct TTS): Targets user-controllable synthesis, with explicit manipulation of prosody (pitch, rate) and emotion. Fine-grained control remains challenging in dialectal contexts.

2. Corpus Statistics, Annotation Schema, and Task Subsets

WenetSpeech-Wu-Bench is founded on WenetSpeech-Wu, an 8,000-hour corpus encompassing eight major sub-dialects plus “Unknown” samples, each utterance annotated with:

  • Wu transcription (ROVER-fused, with confidence)
  • Lexicon-based and LLM-refined Mandarin translation
  • Domain label (e.g., News, Vlog, Podcast)
  • Sub-dialect label (Shanghai, Suzhou, etc.)
  • Speaker attributes (gender/age via VoxProfile)
  • Emotion label (SenseVoice + Emo2Vec/LLMs)
  • Audio quality scores (DNSMOS, SNR, MOS)
  • Prosodic descriptors (speech rate, loudness, energy, pitch statistics)

Task-specific subsets are defined for training and evaluation, balancing domain, speaker, and annotation quality constraints. Table 1 summarizes the principal subsets:

Task Training Subset Hours / Utterances Quality Criterion
ASR ASR-Mid 7,388 h text-conf ≥ 0.60
ASR ASR-High 795 h text-conf ≥ 0.85
AST (Wu→Mandarin) AST 795 h high-quality paired
Speaker Attr. Speaker Attr. 2,986 h single-speaker, high quality
Emotion Rec. Emotion Rec. 500 h SNR > 10dB, pitch std > 50 Hz
TTS TTS-Mid 7,388 h text-conf ≥ 0.60
TTS TTS-High 1,500 h single-speaker, high MOS, SNR
Instruct TTS Inst-Prosody 679 h pitch > 30 Hz, text-conf > 0.7
Instruct TTS Inst-Emotion 161 h emotion labels present

Test sets for each benchmark task are strictly balanced and scenario-diverse. ASR test comprises 4,851 utterances (9.75 h), AST 3,000 utts (4.4 h), gender 3,000 utts, age 1,500 utts, emotion 1,000 utts, TTS “easy” 144 sentences and “hard” 98 sentences, instruct TTS subsets are smaller but cover controlled prosody and emotion conditions.

3. Evaluation Protocols, Metrics, and Formulas

All benchmarks employ standard protocols: audio segmentation via WebRTC VAD, filtering by DNSMOS ≥ 2.0 and SNR ≥ 10 dB, transcript/translation fusion by ROVER and LLMs, and speaker diarization (Pyannote) for single-speaker selection.

Key evaluation metrics are as follows:

CER=S+D+IN\text{CER} = \frac{S + D + I}{N}

where SS, DD, II denote substitutions, deletions, insertions; NN is the reference character count.

  • AST: BLEU Score

BLEU=BPexp(nwnlogpn)BLEU = BP \cdot \exp \left( \sum_n w_n \log p_n \right)

(pnp_n: n-gram precision, wnw_n: uniform weights, BPBP: brevity penalty).

  • Classification (gender, age, emotion): Accuracy

Accuracy=Correct predictionsTotal samples\text{Accuracy} = \frac{\text{Correct predictions}}{\text{Total samples}}

  • TTS: Intelligibility via ASR CER, Speaker Similarity (cosine SIM of embeddings), MOS variants for intelligibility (IMOS), speaker similarity (SMOS), accent naturalness (AMOS), all on 1–5 scale.
  • Instruct TTS: Prosody Control Accuracy, Emotional Control Accuracy, and subjective PMOS/EMOS (prosody/emotion MOS).

4. Released Baselines and Model Architectures

WenetSpeech-Wu-Bench includes open-source baselines for every task, yielding reproducible reference points:

  • ASR:
    • Conformer-U2pp-Wu (123M params): WS-Wu-Bench ASR test CER 15.14%
    • Whisper-Medium-Wu (769M): CER 14.33%
    • Step-Audio2-Wu-ASR (7B, LoRA-finetuned): CER 12.85% (best)
    • Commercial baselines (Qwen3-ASR, Tencent-Cloud-ASR): ~29% CER; open-source Paraformer ~64%
  • Unified Speech Understanding:
    • Step-Audio2-Wu-Und (7B, multi-task): ASR CER 13.23%, AST BLEU 53.13, Gender 95.6%, Age 72.9%, Emotion 71.2%
    • Qwen3-Omni (for comparison): 44.3% CER, 33.3 BLEU, 97.7% gender, 54.1% age, 66.7% emotion
  • TTS:
    • CosyVoice2 base (500M), with variants: CPT on TTS-Mid, SFT on TTS-High, SS SFT (single-speaker). “Easy” test: CER 5.42%, IMOS 4.37, AMOS 4.21 (SS SFT); “Hard” test: CER 15.45%, AMOS 3.88
  • Instruct TTS:
    • CosyVoice2-Wu-instruct: prosody pitch accuracy 74%, rate 82%, PMOS 3.68; emotion control 85% accuracy, EMOS 3.83

All data, code, and checkpoints are available under Apache 2.0 via https://github.com/ASLP-lab/WenetSpeech-Wu-Repo (Wang et al., 16 Jan 2026).

5. Comparison to Prior Mandarin Benchmarks and Foundational Datasets

WenetSpeech-Wu-Bench builds upon the methodology and multi-domain coverage of WenetSpeech (Zhang et al., 2021), extending the paradigm:

  • WenetSpeech offers 22,435 hours Mandarin: 10,005 h strong label, 2,478 h weak, 9,952 h unlabeled (OCR and ASR-based candidate extraction, CTC-based label error detection, multi-domain test sets).
  • Mandarin baseline error rates: Dev 8.88–9.70% MER, test sets up to 15.9% (WeNet/ESPnet), with staged performance improvement via data scale.
  • Wu dialect introduces additional annotation layers (emotion, sub-dialect, Mandarin translation) and systemic challenges (tone sandhi, code-mixing, field imbalance).
  • The Wu benchmark explicitly quantifies its superiority over both commercial and open-source non-dialectal models, demonstrating the necessity of targeted dialectal corpora and benchmarks.

6. Empirical Insights and Limitations

Key findings from WenetSpeech-Wu-Bench include:

  • Dialect-specific training data is essential for competitive performance; Wu models consistently outperform general-purpose and commercial baselines.
  • Compact architectures (Conformer-U2pp-Wu) leverage domain annotation and surpass large LLM-based ASR (e.g., Whisper, Qwen3-ASR) in Wu recognition.
  • Unified multi-task learning (Step-Audio2-Wu-Und) improves cross-task generalization, especially in AST and paralinguistic attribute prediction (gender, age, emotion).
  • Staged TTS strategies (CPT, SFT, single-speaker fine-tuning) greatly enhance intelligibility and accent naturalness on synthesis.
  • Instruct TTS approaches—guided by prosodic and emotional constraints—validate the feasibility of controlled speech synthesis under limited Wu data conditions.
  • The corpus coverage remains unbalanced across sub-dialects (“Unknown” 37%), indicating opportunities for targeted annotation and data enrichment.

A plausible implication is that future research should prioritize balanced domain and sub-dialect collection to enhance coverage and generalizability.

7. Licensing, Data Access, and Ecosystem Impacts

All data, benchmark scripts, and model checkpoints for WenetSpeech-Wu-Bench are hosted at https://github.com/ASLP-lab/WenetSpeech-Wu-Repo under the Apache 2.0 license, permitting broad research and academic usage. The benchmark, in tandem with the underlying WenetSpeech-Wu corpus, establishes the first end-to-end evaluation ecosystem for Wu dialect speech research, facilitating reproducible, systematic advances in speech intelligence across low-resource dialect scenarios (Wang et al., 16 Jan 2026).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WenetSpeech-Wu-Bench.