Typhoon ASR Benchmark

Updated 26 January 2026

Typhoon ASR Benchmark is a comprehensive evaluation suite for Thai ASR, featuring standardized normalization and extensive dialect coverage.
It integrates large-scale, human-annotated datasets across Central Thai and Isan dialects to enable fair, reproducible comparisons.
The benchmark supports both streaming and offline models with open-source resources, accelerating low-latency ASR research.

The Typhoon ASR Benchmark is a rigorously defined evaluation suite for Thai automatic speech recognition (ASR) systems, structured to support robust, low-latency, and dialect-inclusive model development. It features comprehensive, gold-standard, human-annotated datasets with standardized normalization protocols that resolve linguistic ambiguities inherent to Thai. The benchmark underpins research on both Central Thai and Isan dialects, supporting fair, reproducible comparison of streaming and offline ASR architectures (Sirichotedumrong et al., 19 Jan 2026).

1. Dataset Architecture and Scope

The benchmark integrates multiple large-scale corpora to achieve extensive domain and dialect coverage. The General-Thai training corpus comprises approximately 10,999 hours (9.97 million utterances) from sources such as GigaSpeech2, internally curated media, Common Voice 17.0, and internal TTS with enforced numeric normalization. Dialect adaptation utilizes a 303-hour Isan corpus, incorporating internal and public data (e.g., SCB 10X), with a focused coverage of Central Thai (core) and North-eastern Isan. Corpora are drawn from thousands of unique voices; explicit speaker statistics are undisclosed. All test sets are based on human-verified transcriptions that follow established Thai transcription and normalization protocols.

2. Data Splits and Evaluation Tracks

Training uses the entire General-Thai corpus for single-epoch fine-tuning of streaming models (e.g., FastConformer-Transducer), followed by two-stage curriculum-based adaptation for Isan. Validation subsets exist for learning-rate scheduling and early stopping but are not publicly released. Test partitions include:

Standard Track ("GigaSpeech2-Typhoon"): 1.01 hours, 1,000 utterances of read-speech.
Robustness Track ("TVSpeech"): 3.75 hours, 570 utterances of in-the-wild, video-sourced speech.
Isan Dialect Test: Held-out subset from the SCB 10X Isan corpus (size undisclosed).

All test data adhere to controlled, linguistic standards for fair benchmarking.

3. Annotation and Normalization Protocols

Transcriptions conform to canonical Thai conventions (Nathalang et al. 2025), articulated to produce unambiguous, consistent recognition targets. Core rules include:

Number Standardization: All numerals expanded into spoken Thai forms; mixed digit/word formats normalized consistently.
Mai Yamok (Repetition Markers): Context-aware verbalization, e.g., “เก่งๆ” becomes “เก่ง เก่ง.”
Ambiguous Symbols: Context-driven disambiguation (e.g., hyphens in numeric strings).
Foreign Word Transliteration: Loanwords mapped to standardized Thai script forms.
Punctuation and Casing: Non-verbal punctuations are removed; Latin script is either transliterated or omitted following specific rules.
Unified Pipeline: Normalization applied throughout pseudo-label consensus mechanisms and human reviews, reducing label inconsistencies during training and evaluation.

This normalization regime is integral to reproducibility and model interoperability.

4. Evaluation Methodology and Metrics

Three principal tasks are defined:

Standard Track: Clean, read-speech transcription.
Robustness Track: In-the-wild, multi-domain audio robustness.
Isan Dialect Benchmark: Dialect-specific evaluation.

The benchmark adopts Character Error Rate (CER) as the primary metric due to the lack of explicit word boundaries in Thai, calculated as:

$\mathrm{CER} = \frac{S + D + I}{N_\mathrm{chars}}$

where $S$ , $D$ , and $I$ are substitutions, deletions, and insertions, respectively, and $N_\mathrm{chars}$ is the total number of reference characters. Word Error Rate (WER) is included for completeness but is not standard for Thai. Evaluation is run on rigorously normalized, human-verified references, enabling phonetic and semantic consistency across models and submissions.

5. Baseline Results and Comparative Performance

The benchmark collates baseline CERs across proprietary and open-source systems, including streaming and offline backends:

Model Type	Model	TVSpeech	GigaSpeech2-Typhoon	FLEURS Orig. (Norm.)
Proprietary	Gemini 3 Pro	10.95	12.50	11.35 (6.91)
Open-Source Offline	Biodatlab Whisper Large	18.96	13.22	16.50 (15.26)
	Biodatlab Distil-Whisper Lrg	13.82	8.24	6.77 (8.63)
	Pathumma-Whisper Lrg-v3	10.36	5.84	6.29 (7.88)
Ours (Streaming)	Typhoon ASR Realtime	9.99	6.81	13.87 (9.68)
	Typhoon Isan ASR Realtime	9.34	6.93	14.55 (10.15)
Ours (Offline)	Typhoon Whisper Turbo	6.85	4.79	10.52 (7.08)
	Typhoon Whisper Large-v3	6.32	4.69	9.98 (5.69)

For Isan dialect:

Model Type	Model	CER
Proprietary	Gemini 2.5 Pro	10.20
Offline	Whisper-Medium-Dialect	17.72
	SLSCU Korat Model	70.08
	Typhoon-Whisper-Medium-Isan	8.85
Streaming (Stage 1)	Typhoon Isan ASR Realtime (Acoustic)	16.22
Streaming (Stage 2)	Typhoon Isan ASR Realtime (Final)	10.65

Typhoon ASR Realtime, a streaming FastConformer-Transducer with 115M parameters, delivers competitive or superior accuracy relative to major offline models while reducing computational cost by ~45× compared to Whisper Large-v3. Strict normalization can inflate CER on non-normalized test sets (e.g., FLEURS) but matches phonetic accuracy when references are normalized. Two-stage Isan adaptation significantly narrows the dialectal performance gap (Sirichotedumrong et al., 19 Jan 2026).

6. Reproducibility and Open Resources

All datasets, trained models, and code for training and evaluation are openly available:

Code & Models: GitHub repository, FastConformer-Transducer models, Isan models.
Benchmark Datasets: GigaSpeech2-Typhoon, TVSpeech under CC-BY.
Licensing: TVSpeech is CC-BY; other components are open-source under varied terms.

This transparency enables the community to iterate on low-latency ASR architectures, normalization strategies, and dialect adaptation approaches for tonal and morphologically complex languages, supporting rigorous, reproducible experimentation (Sirichotedumrong et al., 19 Jan 2026).

7. Impact and Research Significance

The Typhoon ASR Benchmark addresses crucial gaps in Thai ASR research—most notably the lack of standardized, normalized, and dialect-inclusive test suites for streaming and low-resource applications. Adoption of this benchmark allows for:

Direct comparison of streaming versus offline model performance with controlled normalization.
Rapid iteration on text normalization pipelines that are essential for languages without explicit word boundaries and with complex script conventions.
Fair, repeatable scoring across both Central Thai and Isan tasks, informing development of robust ASR for real-world deployment.
Acceleration of low-latency ASR system research for underrepresented Asian languages, bolstering inclusivity and technological accessibility (Sirichotedumrong et al., 19 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Typhoon ASR Benchmark.