Papers
Topics
Authors
Recent
Search
2000 character limit reached

SEAME Dataset for Mandarin-English ASR

Updated 6 January 2026
  • SEAME Dataset is a bilingual Mandarin-English conversational corpus featuring natural intra- and inter-sentential code-switching in real-world settings.
  • It comprises approximately 192 hours of recordings with 23,000 utterances that capture diverse linguistic, prosodic, and acoustic phenomena.
  • The dataset serves as a benchmark for assessing ASR systems under challenges like accent variability, conversational noise, and low-resource conditions.

The SEAME (South East Asia Mandarin-English) dataset is a large-scale corpus of spontaneous conversational speech characterized by rich intra- and inter-sentential code-switching between Mandarin and English. Developed for advancing automatic speech recognition (ASR) of code-switched speech, SEAME provides comprehensive coverage of linguistic, prosodic, and acoustic phenomena encountered in real-world bilingual environments in Singapore and Malaysia. SEAME has become a key benchmark for evaluating ASR models' robustness to language switching, diversity of speaker accents, and the challenges of informal dialogic communication (Dalmia et al., 2020, Yeo et al., 2 Jan 2026).

1. Corpus Composition and Collection Procedures

SEAME comprises high-fidelity conversational recordings, each involving one or two bilingual speakers engaged in natural discussions on topics ranging from daily routines to contemporary issues. Data collection was conducted in both Singapore and Malaysia, using close-talk microphones at 16 kHz/16-bit resolution. Recording locations included cafés and university common areas, supporting the occurrence of typical background noise and disfluencies characteristic of spontaneous oral interaction. Speaker demographics include 156 participants (80 female, 76 male), with approximately equal representation of Singaporean and Malaysian backgrounds and an age range of 18–40 years. All participants identified as fluently bilingual in Mandarin (Standard Mandarin, SM) and English (Yeo et al., 2 Jan 2026).

The sessions elicit both intra-sentential (within the same utterance) and inter-sentential (across utterance) code-switches. Mandarin tokens are written in Chinese characters (≈2 500 unique), while English is transcribed in lower-case Roman script. The corpus contains both fully Mandarin and fully English utterances, as well as mixed-language utterances displaying 1–3 code-switches on average.

2. Size, Partitioning, and Speaker Distribution

SEAME's total audio duration is approximately 192 hours, segmented into roughly 23 000 utterances. Standard partitioning, as established in the ESPnet recipe and papers, is as follows:

Subset Duration (h) Speakers Utterances Avg switches/utt
Train 100 100* 12 000 1.8
DevMan 5.5 28 650 1.9
DevSGE 6.0 28 720 1.7
Full corpus 192 156 ~23 000 1.8

*Train subset is drawn from the full speaker set (Yeo et al., 2 Jan 2026). The development sets DevMan and DevSGE provide phonetic, prosodic, and accentual variation, enabling rigorous measurement of generalization across speaker populations. "Test_man" (Mandarin-biased) and "test_sge" (English-biased) sets, as described in (Dalmia et al., 2020), serve as alternative evaluation partitions, with exact durations and utterance counts not specified.

3. Linguistic and Code-Switching Characteristics

SEAME demonstrates high-frequency, spontaneous code-switching: the mean number of switches per utterance is 1.8. The distribution is such that 45% of utterances contain exactly one switch, 30% two switches, and 15% three or more switches (Yeo et al., 2 Jan 2026). Approximately 60–70% of tokens are Mandarin, and 30–40% English by word count, though no exact ratios are reported per partition (Yeo et al., 2 Jan 2026).

Switching includes both minor intra-clause shifts (e.g., noun phrase substitutions) and major alternations (full clause or utterance boundary changes), reflecting the broad spectrum of naturally occurring bilingual speech. The presence of filled pauses, hesitations, repetitions, and background sound provides a high degree of acoustic and lexical realism.

4. Annotation Protocols and Preprocessing

Annotation is performed at the utterance level with manual segmentation based on natural pause and turn-taking cues (no word-level time alignment). Mandarin tokens are presented in Chinese characters, English in Roman script; language identification is implicit via orthographic system choice. There is no explicit use of language-ID tags in the original corpus. No phoneme-level annotation is included—end-to-end models operate at the grapheme/subword level (Dalmia et al., 2020, Yeo et al., 2 Jan 2026).

Transcription guidelines enforce word-level accuracy, with only a period (.) for utterance termination. Disfluencies and hesitations are transcribed as [uh] or [um]. Preprocessing pipelines extract 83-dimensional log-Mel filterbanks with pitch from 16 kHz waveform, mean–variance normalized globally. Data augmentation includes speed perturbation (rates 0.9×, 1.0×, 1.1×) and SpecAugment ("SS policy"). The standard ESPnet recipe is used for segmenting transcripts to utterances (Dalmia et al., 2020).

5. Evaluation Metrics and Benchmarking

SEAME uses Mixed Error Rate (MER) as its principal evaluation metric, defined identically to word error rate (WER) except that Mandarin is scored at the character level and English at the word (or subword) level. The formula applied is

MER=S+I+DN×100%\mathrm{MER} = \frac{S + I + D}{N} \times 100\%

where SS is substitutions, II is insertions, DD is deletions, and NN is the total number of reference tokens (Mandarin characters plus English words or subwords) (Dalmia et al., 2020, Yeo et al., 2 Jan 2026). Scoring is typically performed using NIST sclite, without post-normalization.

Performance statistics as reported with strong baseline and augmented ASR systems on SEAME are:

System Training Data DevMan MER (%) DevSGE MER (%)
Real speech only 100 h real 12.1 17.8
Real + TTS-R (1:1) 100 h real + 100 h TTS 10.1 16.0
Previous RNN-T baseline ≈87 h real 22.2*
Proposed T-T (2020) ≈87 h real 18.5* (test_man) 26.3* (test_sge)

*For dev and test sets as per (Dalmia et al., 2020). These results highlight gains achieved through synthetic data augmentation and transformer-transducer model architectures.

6. Research Use Cases and Technical Challenges

SEAME presents unique challenges for ASR systems:

  • Code-switching frequency and unpredictability: Models must learn language alternation patterns not present in monolingual data.
  • Speaker and accent variability: Broad nationality, accent, and prosodic variability increases the demand on ASR robustness. DevMan and DevSGE subsets allow detailed evaluation of cross-demographic generalization.
  • Conversational noise: Utterances contain natural conversational artifacts (disfluencies, fillers, background noise, laughter).
  • Low-resource setting: Although the corpus is large for code-switching data, it remains small for data-hungry end-to-end ASR frameworks (Dalmia et al., 2020, Yeo et al., 2 Jan 2026).

Addressing these challenges, recent ASR systems employ LID-aware training augmentations, auxiliary loss functions, SpecAugment, and multi-label/multi-audio encoder conditioning, as well as synthetic data augmentation strategies such as fine-tuned multilingual TTS models (Dalmia et al., 2020, Yeo et al., 2 Jan 2026).

7. Impact, Significance, and Benchmark Status

SEAME has become the de facto benchmark for Mandarin-English code-switched ASR in the academic community, supporting rigorous cross-system comparisons due to its public availability, annotation clarity, and demographic diversity. Its use in conjunction with synthetic data augmentation techniques demonstrates ongoing development in coping with low-resource and high-variability settings. The separation into demographically distinct evaluation subsets (DevMan and DevSGE) remains critical for exposing ASR biases and failures under cross-accent and cross-nationality conditions (Yeo et al., 2 Jan 2026).

A plausible implication is that improvements observed on SEAME generalize to other Southeast Asian English-Mandarin bilingual communities, but distinct accent, lexical, and prosodic properties may require continued data expansion and explicit demographic conditioning.

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SEAME Dataset.