Papers
Topics
Authors
Recent
Search
2000 character limit reached

WenetSpeech-Wu: Wu Dialect Speech Ecosystem

Updated 23 January 2026
  • WenetSpeech-Wu is a comprehensive ecosystem with ~8,000 hours of richly annotated Wu dialect speech data covering multiple sub-dialects and paralinguistic features.
  • It offers standardized benchmarks across tasks such as ASR, Wu-to-Mandarin translation, speaker attributes, emotion recognition, TTS, and Instruct-TTS with competitive performance metrics.
  • The resource integrates robust, open-source models using advanced techniques like dynamic batching, loss fusion, and fine-tuning strategies to tackle dialect-specific challenges.

WenetSpeech-Wu is a comprehensive, open-source speech processing ecosystem for the Chinese Wu dialect, providing the first large-scale, richly annotated corpus (∼8,000 hours), unified multi-task benchmarks, and a suite of strong, publicly released models for automatic speech recognition (ASR), Wu-to-Mandarin translation, speaker attribute prediction, speech emotion recognition, text-to-speech synthesis (TTS), and instruction-following TTS (Instruct-TTS). This resource addresses the longstanding challenge posed by the lack of large-scale data, standard evaluation protocols, and accessible dialect-centric models, thereby facilitating robust research and development in Wu dialect speech intelligence (Wang et al., 16 Jan 2026).

1. Dataset Construction and Multi-dimensional Annotation

WenetSpeech-Wu comprises approximately 8,000 hours of speech, distributed over 3.86 million utterances, with an average length of 7.45 seconds and a maximum of 30 seconds. Coverage spans eight Wu sub-dialects—Shanghainese, Suzhounese, Shaoxingnese, Ningbonese, Hangzhounese, Jiaxingnese, Taizhounese, and Wenzhounese—while 37% of samples are marked “Unknown” for unassigned sub-dialect cases.

The dataset draws from diverse domains: News, Culture, Vlog, Entertainment, Education, Podcast, Commentary, Interview, Radio Drama, Music Program, and Audiobook. Speaker diversity encompasses thousands of individuals, with gender (4,135 h male, 1,331 h female) and age groups (teenagers: 372 h, youth: 1,673 h, middle-aged: 2,003 h, elderly: 1,418 h) inferred automatically.

Paralinguistic annotation covers five emotion classes, with segment distributions as follows: Neutral (5,102 h), Happy (73 h), Sad (81 h), Angry (109 h), Surprised (101 h). Annotation pipelines employ multiple models and tools: two Wu-dialect ASR models (Tele-CTC-FT, Step-Audio2-FT), Dolphin, and TeleASR for transcription, with ROVER fusion for consensus output and confidence estimation; Wu→Mandarin translation leverages lexicon mapping refined by Qwen3-8B; speaker attributes are assigned via VoxProfile and Pyannote; emotion utilizes a cross-modal ensemble (SenseVoice, Emo2Vec, Qwen3-8B, DeepSeek-R1, Gemini-2.5-Pro) with intersectional labeling; prosody statistics (rate, loudness, energy, pitch) are quantified through Dataspeech. Audio quality filtering uses DNSMOS, SNR (predominant range 10–40 dB), and MOS (primarily 2.0–3.5).

Quality control comprises tiered splits optimized for specific tasks (Table 1 from (Wang et al., 16 Jan 2026)), e.g., ASR-Mid (7,388 h, conf > 0.60), ASR-High (795 h, conf > 0.85), TTS-High (1,500 h), emotion set (500 h single-speaker, SNR>10 dB, expressive samples only). Pre-processing incorporates metadata filtering (removal of non-Wu content), WebRTC VAD, DNSMOS & SNR checks, multi-stage paralinguistic filtering, ROVER-based transcript fusion, and task-targeted selection.

2. Benchmark Suite: WenetSpeech-Wu-Bench

WenetSpeech-Wu-Bench is the first standardized, open Wu dialect benchmarking suite, spanning six major tasks, each with rigorously curated test sets:

Task Test Set Size Primary Metrics
Automatic Speech Recognition 4,851 utterances, 9.75 h Character Error Rate (CER)
Wu→Mandarin Speech Translation 3,000 utterances, 4.4 h BLEU (n-gram precision, brevity penalty)
Speaker Attribute Prediction Gender: 3,000 / Age: 1,500 Per-category & overall accuracy
Speech Emotion Recognition 1,000 samples, 1.41 h Per-emotion & overall accuracy
Text-to-Speech (TTS) 242 sentences, 12 speakers CER, speaker similarity, MOS (IMOS, SMOS, AMOS)
Instruct TTS Prosody: 100; Emotion: 500 Control acc., PMOS, EMOS

ASR evaluation employs Character Error Rate: CER=S+D+IN\mathrm{CER} = \frac{S + D + I}{N}, with S=substitutions, D=deletions, I=insertions, N=reference character count. AST leverages BLEU: BLEU=exp(n=14wnlogpn)×BP\mathrm{BLEU} = \exp(\sum_{n=1}^4 w_n \log p_n) \times BP.

Speaker attribute and emotion benchmarks report per-category and aggregate classification accuracies. TTS incorporates both objective (ASR-CER, speaker similarity) and subjective metrics (IMOS, SMOS, AMOS). Instruct-TTS measures controllability through instruction-matched proportions, PMOS, EMOS, and emotion classification accuracy.

3. Model Architectures and Training Protocols

The WenetSpeech-Wu ecosystem includes a series of open-source models, each optimized for their corresponding benchmark tasks:

  • ASR Models:
    • Conformer-U2pp-Wu (123 M parameters): Utilizes a 12-block Conformer encoder and 6-layer Transformer decoder with a 512-dim embedding. Training optimizes joint CTC and cross-entropy losses with dynamic batching (≤60k frames), using a 1e–3 learning rate and WarmupLR.
    • Whisper-Medium-Wu (769 M): Fine-tuned on Wu data from Whisper-Medium, with dynamic batches (≤24k frames), 8e–5 learning rate, WarmupLR, and grad-acc=4.
    • Step-Audio2-Wu-ASR (7 B): LoRA-based fine-tuning of Step-Audio2-mini; batch=8, lr=1e–5, WarmupLR, grad-acc=8. Training utilizes NVIDIA A100 GPUs and the WeNet/MS-Swift frameworks.
  • Unified Speech Understanding:

Step-Audio2-Wu-Und performs ASR, AST, gender, age, and emotion tasks using a shared Transformer encoder paired with task-specific decoders. Training proceeds from pretraining on ASR-Mid and high-quality paralinguistic sets to fine-tuning on ASR-High and remaining tasks, with combined loss: L=tλtLt\mathcal{L} = \sum_{t} \lambda_t \mathcal{L}_t (cross-entropy, CTC, or both as appropriate).

  • TTS and Instruct-TTS (CosyVoice2 variants):
    • CosyVoice2: A 500 M parameter Transformer-based architecture (text encoder, mel-decoder, duration predictor).
    • CPT: Continued pre-training on TTS-Mid for 10 epochs.
    • SFT: Fine-tuned on TTS-High for 3 epochs.
    • SS-SFT: Single-speaker fine-tuning for 10 hours.
    • Instruct-TTS: Additional fine-tuning on Inst-Pro/Inst-Emo sets with explicit instruction tokens for prosodic or emotional control.
    • Loss function: LTTS=MelpredMeltrue1+αLdurL_{TTS} = \| \mathrm{Mel}_{pred} - \mathrm{Mel}_{true} \|_1 + \alpha \, L_{dur}. Hyperparameters are task-specific and adopt dynamic batching and constant or warmup-based learning rate schedules.

4. Empirical Performance and Analysis

The released models demonstrate competitive results across all WenetSpeech-Wu-Bench tasks:

  • ASR:

Step-Audio2-Wu-ASR achieves the top open-source performance (CER 12.85%), outperforming Whisper-Medium-Wu (14.33%) and Conformer-U2pp-Wu (15.14%). Baselines such as Qwen3-ASR, Tencent-Cloud, and Qwen3-Omni are less competitive (CERs of 29.31%, 29.48%, 44.27%, respectively). In controlled settings, Step-Audio2-Wu-ASR achieves 7.86% CER (reading) and 8.68% (dialogue).

  • Speech Understanding (AST and Paralinguistics):

For Wu→Mandarin AST, Step-Audio2-Wu-Und achieves BLEU 53.13, surpassing Step-Audio2-mini (37.81) and Qwen3-Omni (33.31). Classification accuracies: gender 0.956 (Wu-Und), age 0.729, emotion 0.712. These outperform comparable baselines across all categories.

  • Text-to-Speech:

CosyVoice2-Wu-SS attains best-in-class results in TTS—easy set: CER 5.42%, IMOS 4.37, AMOS 4.21; hard set: CER 15.45%, SMOS 4.04, AMOS 3.88. Commercial Qwen3-TTS lags marginally (easy CER 5.95%, hard CER 16.45%).

  • Instruct-TTS:

Instruction-tuned CosyVoice2 shows marked improvements in prosodic control (rate/pitch: 0.82/0.74 vs. 0.26/0.24 pre-fine-tuning), PMOS 3.68 (from 2.13), and elevated emotion control accuracies (0.94/0.87/0.88/0.73 for happy/angry/sad/surprised, up from 0.87/0.83/0.84/0.67).

Error analysis identifies tonal sandhi and significant sub-dialect variation as persisting challenges for ASR, especially under low SNR and code-switched scenarios. Paralinguistic performance is sensitive to single-speaker data quality, with underrepresented emotion classes (e.g., Happy) yielding lower accuracy. Extreme prosodic instructions in Instruct-TTS remain problematic, suggesting increased model capacity as a future direction.

5. Ecosystem Structure, Standardization, and Recommendations

WenetSpeech-Wu’s ecosystem enables a virtuous cycle: the open data resource supports the development and evaluation of models on public benchmarks (WenetSpeech-Wu-Bench), while these models in turn drive annotation quality and facilitate community contributions. All resources utilize a unified data format (JSON) and standardized prompt templates to ensure reproducibility and facilitate pipeline integration across tasks.

The CC-BY-NC-SA license underpins dataset use, with citation required. Recommended usage: “Mid” quality data for large-scale pre-training, “High” quality subsets for supervised fine-tuning, and specialized splits for AST, emotion, and instruct TTS. The annotation and model training pipeline is amenable to adaptation for other Chinese dialects (e.g., Gan, Min) through retraining of dialect identification and transcription ensemble modules.

6. Extensions and Prospects

Planned enhancements include manual refinement of critical data subsets, addressing imbalances in sub-dialect representation, focused modeling of Wu tonal sandhi, and the development of multi-dialect, multi-task LLMs. Extension of the corpus construction, annotation, and evaluation pipeline to other dialectal contexts is facilitated by modularity in the current architecture. A plausible implication is that WenetSpeech-Wu can serve as a reference blueprint for constructing inclusive dialect speech intelligence resources in other linguistically diverse contexts.

WenetSpeech-Wu thus establishes the foundational infrastructure for Wu dialect speech technology, delivering the first large-scale, multi-dimensionally annotated corpus, standardized cross-task benchmarks, and competitive open-source models for the advancement of inclusive Wu dialect speech processing (Wang et al., 16 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WenetSpeech-Wu.