Typhoon ASR Real-time: Efficient Thai ASR

Updated 26 January 2026

The paper introduces a 115M FastConformer-Transducer model that achieves near-offline accuracy in a streaming setup with a 45× reduction in computational cost.
Typhoon ASR Real-time is defined by its advanced text normalization and two-stage curriculum-based dialect adaptation that enhances transcription consistency across Thai dialects.
Benchmark results reveal a real-time factor of 0.05 and token emission latency of ≈50 ms, demonstrating its practical efficiency for real-world, resource-constrained applications.

Typhoon ASR Real-time is a streaming Thai automatic speech recognition (ASR) system based on a 115M-parameter FastConformer-Transducer model. It is designed to deliver low-latency transcription in operational settings, matching or closely approaching the recognition accuracy of much larger offline models while reducing computational cost by 45× compared to baseline Whisper Large-v3 architectures. Through a combination of sequence-to-sequence deep learning, advanced text normalization, and curriculum-based dialect adaptation, Typhoon ASR Real-time establishes a new standard for efficient, robust, and reproducible open-source Thai ASR (Sirichotedumrong et al., 19 Jan 2026).

1. Model Architecture and Training Regime

Typhoon ASR Real-time is architected for streaming inference using a FastConformer-Transducer backbone. The input is 80-dimensional log-Mel filterbank features (25 ms window, 10 ms shift), which are subsampled by an 8× depthwise-convolutional front-end, implemented as four Conv2D layers (kernel 3×3, stride 2, 256 channels). The encoder stacks 12 Conformer blocks comprising Macaron-style feed-forward modules (hidden size 2048), 8-head multi-head self-attention (model dimension 512), depthwise convolution (kernel size 31), and layer normalization. Relative position encoding is used with infinite left context and right context constrained to one chunk (≈16 frames), enabling efficient chunkwise streaming.

The transducer head integrates a 2-layer LSTM (512 units) as a token predictor and a feed-forward joiner mapping to the vocabulary. The entire model contains 115 million parameters. Training uses the RNN-Transducer (RNN-T) loss:

$\mathcal{L}_{\rm RNNT} = -\ln \sum_{\pi \in \mathcal{B}^{-1}(y)} p(\pi \mid x)$

where alignments $\pi$ are enumerated over blank-compressed tokenizations $\mathcal{B}^{-1}(y)$ .

Optimization is conducted with AdamW (weight decay 0.01, peak learning rate $10^{-3}$ , cosine annealing, 5k warmup steps), batch size 128 post-accumulation, dropout 0.1 (feed-forward and attention). The model is initialized from English FastConformer-Transducer (Large), then fine-tuned for one epoch over 11,000 h of general Thai data.

2. Latency, Real-Time Factor, and Hardware Efficiency

A defining feature of Typhoon ASR Real-time is its operational streaming efficiency, characterized by low Real-Time Factor (RTF) and minimal inference latency. RTF is defined as:

$\mathrm{RTF} = \frac{\text{wall-clock inference time}}{\text{audio duration}}$

Benchmarking against Pathumma-Whisper Large-v3 (1.55B parameters, 900 GFLOPs), Typhoon ASR Real-time achieves a 45× reduction in floating-point operations (20 GFLOPs per 30 s audio) and 13× fewer parameters. End-to-end RTF is 0.05 (versus 2.1), with average token emission latency ≈50 ms and peak GPU memory usage ≈2.3 GB (NVIDIA A100). This enables real-time transcription on single-GPU servers with consistent low-latency token output.

Model	Params	FLOPs	RTF	Latency
Pathumma-Whisper Large-v3	1.55B	900G	2.1	≈1.5 s
Typhoon ASR Real-time	115M	20G	0.05	≈50 ms

3. Text Normalization and Consistency

A comprehensive rule-based text normalization pipeline ensures that evaluation is linguistically consistent and suitable for rigorous downstream use. The system incorporates modules for context-dependent number verbalization (digit-to-word and composition rules), disambiguation of Thai repetition markers (mai yamok, ฯ), treatment of hyphens/dashes (“ถึง”, “ลบ”, “คิด”), and a finite-state lexicon for common English loanwords. The normalization pipeline, as formalized by Nathalang et al. (2025), maps both training and test transcripts to canonical forms. For instance, the numeric string “10150” is normalized to “หนึ่ง ศูนย์ หนึ่ง ห้า ศูนย์”.

This normalization reduces systemic ambiguity in transcription targets, which is reflected in tighter correspondence between model references and evaluation protocols, providing fair comparison with prior models under canonical scoring.

4. Dialect Adaptation via Two-Stage Curriculum Learning

To address intra-language variation, Typhoon ASR Real-time introduces a two-stage curriculum adaptation from Central Thai to Isan (north-eastern) dialect. The curriculum leverages 303 hours of mixed-domain data, with batches sampled according to mixture weights:

$p_{\rm Isan} \approx 0.185,\quad p_{\rm GenThai} \approx 0.608,\quad p_{\rm TTS+Markers} = 0.207.$

The first stage adaptively fine-tunes all parameters at learning rate $10^{-5}$ over 10 epochs for acoustic domain matching. The second stage freezes the encoder, refining only the decoder and joint network at learning rate $10^{-3}$ for 15 epochs to specialize linguistic representations. This method reduces catastrophic forgetting and preserves Central Thai accuracy during Isan dialect optimization.

5. Typhoon ASR Benchmark and Evaluation Protocols

Typhoon ASR Real-time is evaluated on the Typhoon ASR Benchmark, comprising two test tracks with human-labeled, normalized references:

Standard Track (Gigaspeech2-Typhoon): 1.01 hours, 1,000 clean utterances.
Robustness Track (TVSpeech): 3.75 hours, 570 video speech utterances with lexical and acoustic variability.

All transcripts comply with the strict normalization pipeline, and the associated evaluation toolkit computes character error rate (CER) under these conventions. Reproducibility is further enhanced through the release of open-source code, model cards, and evaluation scripts.

6. Comparative Performance and Key Results

Typhoon ASR Real-time matches or outperforms established offline Thai ASR baselines on standardized metrics at significantly lower computational burden. On the Typhoon ASR Benchmark:

Model Type	Model	TVSpeech CER	Gigaspeech2 CER	FLEURS Orig. (Norm.) CER
Proprietary Foundation	Gemini 3 Pro	10.95	12.50	11.35 (6.91)
Open-Source Offline	Pathumma-Whisper Large-v3	10.36	5.84	6.29 (7.88)
Streaming (Ours)	Typhoon ASR Real-time	9.99	6.81	13.87 (9.68)
Streaming (Ours) (Isan)	Typhoon Isan ASR Real-time	9.34	6.93	14.55 (10.15)
Offline (Ours)	Typhoon Whisper Large-v3	6.32	4.69	9.98 (5.69)

On the Isan SCB 10X test set, the two-stage curriculum yields a streaming Isan model with 10.65% CER, within 0.45% of the Gemini 2.5 Pro foundation model. The strict normalization closes the reference gap with Whisper baselines, enabling valid apples-to-apples comparison. A plausible implication is that normalization and domain-adaptive curriculum are as critical as network scaling for achieving competitive streaming ASR in low-resource, morphologically rich languages.

7. Significance, Limitations, and Prospects

Typhoon ASR Real-time establishes a new operational paradigm for Thai ASR research by delivering streaming transcription with low latency, high accuracy across dialects, and open evaluation standards. This suggests clear advantages in environments where rapid turn-around or deployment on modest hardware is required, such as telecommunications, live broadcast, or low-bandwidth devices. Limitations may include potential underperformance on unseen dialectal or code-switched speech outside the covered curriculum or under resource constraints beyond those evaluated.

Open-source release of benchmarks and models addresses reproducibility challenges endemic in the Thai ASR literature, providing a new basis for standardized comparison and future innovation (Sirichotedumrong et al., 19 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Typhoon ASR Real-time.

Typhoon ASR Real-time: Efficient Thai ASR

1. Model Architecture and Training Regime

2. Latency, Real-Time Factor, and Hardware Efficiency

3. Text Normalization and Consistency

4. Dialect Adaptation via Two-Stage Curriculum Learning

5. Typhoon ASR Benchmark and Evaluation Protocols

6. Comparative Performance and Key Results

7. Significance, Limitations, and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Typhoon ASR Real-time: Efficient Thai ASR

1. Model Architecture and Training Regime

2. Latency, Real-Time Factor, and Hardware Efficiency

3. Text Normalization and Consistency

4. Dialect Adaptation via Two-Stage Curriculum Learning

5. Typhoon ASR Benchmark and Evaluation Protocols

6. Comparative Performance and Key Results

7. Significance, Limitations, and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research