Typhoon ASR Real-time: Efficient Thai ASR
- The paper introduces a 115M FastConformer-Transducer model that achieves near-offline accuracy in a streaming setup with a 45× reduction in computational cost.
- Typhoon ASR Real-time is defined by its advanced text normalization and two-stage curriculum-based dialect adaptation that enhances transcription consistency across Thai dialects.
- Benchmark results reveal a real-time factor of 0.05 and token emission latency of ≈50 ms, demonstrating its practical efficiency for real-world, resource-constrained applications.
Typhoon ASR Real-time is a streaming Thai automatic speech recognition (ASR) system based on a 115M-parameter FastConformer-Transducer model. It is designed to deliver low-latency transcription in operational settings, matching or closely approaching the recognition accuracy of much larger offline models while reducing computational cost by 45× compared to baseline Whisper Large-v3 architectures. Through a combination of sequence-to-sequence deep learning, advanced text normalization, and curriculum-based dialect adaptation, Typhoon ASR Real-time establishes a new standard for efficient, robust, and reproducible open-source Thai ASR (Sirichotedumrong et al., 19 Jan 2026).
1. Model Architecture and Training Regime
Typhoon ASR Real-time is architected for streaming inference using a FastConformer-Transducer backbone. The input is 80-dimensional log-Mel filterbank features (25 ms window, 10 ms shift), which are subsampled by an 8× depthwise-convolutional front-end, implemented as four Conv2D layers (kernel 3×3, stride 2, 256 channels). The encoder stacks 12 Conformer blocks comprising Macaron-style feed-forward modules (hidden size 2048), 8-head multi-head self-attention (model dimension 512), depthwise convolution (kernel size 31), and layer normalization. Relative position encoding is used with infinite left context and right context constrained to one chunk (≈16 frames), enabling efficient chunkwise streaming.
The transducer head integrates a 2-layer LSTM (512 units) as a token predictor and a feed-forward joiner mapping to the vocabulary. The entire model contains 115 million parameters. Training uses the RNN-Transducer (RNN-T) loss:
where alignments are enumerated over blank-compressed tokenizations .
Optimization is conducted with AdamW (weight decay 0.01, peak learning rate , cosine annealing, 5k warmup steps), batch size 128 post-accumulation, dropout 0.1 (feed-forward and attention). The model is initialized from English FastConformer-Transducer (Large), then fine-tuned for one epoch over 11,000 h of general Thai data.
2. Latency, Real-Time Factor, and Hardware Efficiency
A defining feature of Typhoon ASR Real-time is its operational streaming efficiency, characterized by low Real-Time Factor (RTF) and minimal inference latency. RTF is defined as:
Benchmarking against Pathumma-Whisper Large-v3 (1.55B parameters, 900 GFLOPs), Typhoon ASR Real-time achieves a 45× reduction in floating-point operations (20 GFLOPs per 30 s audio) and 13× fewer parameters. End-to-end RTF is 0.05 (versus 2.1), with average token emission latency ≈50 ms and peak GPU memory usage ≈2.3 GB (NVIDIA A100). This enables real-time transcription on single-GPU servers with consistent low-latency token output.
| Model | Params | FLOPs | RTF | Latency |
|---|---|---|---|---|
| Pathumma-Whisper Large-v3 | 1.55B | 900G | 2.1 | ≈1.5 s |
| Typhoon ASR Real-time | 115M | 20G | 0.05 | ≈50 ms |
3. Text Normalization and Consistency
A comprehensive rule-based text normalization pipeline ensures that evaluation is linguistically consistent and suitable for rigorous downstream use. The system incorporates modules for context-dependent number verbalization (digit-to-word and composition rules), disambiguation of Thai repetition markers (mai yamok, ฯ), treatment of hyphens/dashes (“ถึง”, “ลบ”, “คิด”), and a finite-state lexicon for common English loanwords. The normalization pipeline, as formalized by Nathalang et al. (2025), maps both training and test transcripts to canonical forms. For instance, the numeric string “10150” is normalized to “หนึ่ง ศูนย์ หนึ่ง ห้า ศูนย์”.
This normalization reduces systemic ambiguity in transcription targets, which is reflected in tighter correspondence between model references and evaluation protocols, providing fair comparison with prior models under canonical scoring.
4. Dialect Adaptation via Two-Stage Curriculum Learning
To address intra-language variation, Typhoon ASR Real-time introduces a two-stage curriculum adaptation from Central Thai to Isan (north-eastern) dialect. The curriculum leverages 303 hours of mixed-domain data, with batches sampled according to mixture weights:
The first stage adaptively fine-tunes all parameters at learning rate over 10 epochs for acoustic domain matching. The second stage freezes the encoder, refining only the decoder and joint network at learning rate for 15 epochs to specialize linguistic representations. This method reduces catastrophic forgetting and preserves Central Thai accuracy during Isan dialect optimization.
5. Typhoon ASR Benchmark and Evaluation Protocols
Typhoon ASR Real-time is evaluated on the Typhoon ASR Benchmark, comprising two test tracks with human-labeled, normalized references:
- Standard Track (Gigaspeech2-Typhoon): 1.01 hours, 1,000 clean utterances.
- Robustness Track (TVSpeech): 3.75 hours, 570 video speech utterances with lexical and acoustic variability.
All transcripts comply with the strict normalization pipeline, and the associated evaluation toolkit computes character error rate (CER) under these conventions. Reproducibility is further enhanced through the release of open-source code, model cards, and evaluation scripts.
6. Comparative Performance and Key Results
Typhoon ASR Real-time matches or outperforms established offline Thai ASR baselines on standardized metrics at significantly lower computational burden. On the Typhoon ASR Benchmark:
| Model Type | Model | TVSpeech CER | Gigaspeech2 CER | FLEURS Orig. (Norm.) CER |
|---|---|---|---|---|
| Proprietary Foundation | Gemini 3 Pro | 10.95 | 12.50 | 11.35 (6.91) |
| Open-Source Offline | Pathumma-Whisper Large-v3 | 10.36 | 5.84 | 6.29 (7.88) |
| Streaming (Ours) | Typhoon ASR Real-time | 9.99 | 6.81 | 13.87 (9.68) |
| Streaming (Ours) (Isan) | Typhoon Isan ASR Real-time | 9.34 | 6.93 | 14.55 (10.15) |
| Offline (Ours) | Typhoon Whisper Large-v3 | 6.32 | 4.69 | 9.98 (5.69) |
On the Isan SCB 10X test set, the two-stage curriculum yields a streaming Isan model with 10.65% CER, within 0.45% of the Gemini 2.5 Pro foundation model. The strict normalization closes the reference gap with Whisper baselines, enabling valid apples-to-apples comparison. A plausible implication is that normalization and domain-adaptive curriculum are as critical as network scaling for achieving competitive streaming ASR in low-resource, morphologically rich languages.
7. Significance, Limitations, and Prospects
Typhoon ASR Real-time establishes a new operational paradigm for Thai ASR research by delivering streaming transcription with low latency, high accuracy across dialects, and open evaluation standards. This suggests clear advantages in environments where rapid turn-around or deployment on modest hardware is required, such as telecommunications, live broadcast, or low-bandwidth devices. Limitations may include potential underperformance on unseen dialectal or code-switched speech outside the covered curriculum or under resource constraints beyond those evaluated.
Open-source release of benchmarks and models addresses reproducibility challenges endemic in the Thai ASR literature, providing a new basis for standardized comparison and future innovation (Sirichotedumrong et al., 19 Jan 2026).