Montreal Forced Aligner Overview

Updated 14 January 2026

Montreal Forced Aligner is an open-source tool that automatically aligns speech audio with text using Kaldi’s GMM–HMM backend.
It generates phone- and word-level time-stamped annotations for large speech corpora, aiding corpus phonetics, speech technology, and multilingual fieldwork.
Its modular workflow—from pretrained models to forced alignment scripts—ensures high temporal precision and adaptable performance across diverse languages and domains.

The Montreal Forced Aligner (MFA) is an open-source system for automatic time-alignment of speech audio and text, based on Kaldi’s Gaussian Mixture Model–Hidden Markov Model (GMM–HMM) backend. MFA is designed to produce phone- and word-level time-aligned annotations for large speech corpora, facilitating scalable linguistic analysis and automation of manual segmentation tasks. It offers cross-platform distribution (Python package, bundled Kaldi binaries), pretrained acoustic models and grapheme-to-phoneme (G2P) models for numerous languages, and exposes straightforward command-line workflows for forced alignment and model training. MFA is widely used in corpus phonetics, computational linguistics, speech technology, and multilingual fieldwork.

1. Acoustic Modeling and Forced Alignment Algorithm

MFA relies on a GMM–HMM architecture, supported by Kaldi’s aligner scripts, with four major training stages: monophone GMM–HMMs, triphone GMM–HMMs (context-dependent), Linear Discriminant Analysis (LDA) and Maximum Likelihood Linear Transform (MLLT) for feature normalization, and Speaker Adaptive Training (SAT) using feature-space Maximum Likelihood Linear Regression (fMLLR) (Rousso et al., 2024, Chodroff, 2018). The acoustic model parameters $\lambda=(A,B,\pi)$ define the transition ( $A$ ), emission ( $B$ ), and initial state ( $\pi$ ) probabilities for the HMM. Each emission distribution $b_j(x)$ is modeled as a mixture of $M$ Gaussians: $b_j(x)=\sum_{m=1}^{M} c_{j,m}\,\mathcal N(x;\mu_{j,m},\Sigma_{j,m})$ Feature extraction follows the standard MFCC protocol: 12 MFCCs plus log-energy, together with delta ( $\Delta$ ) and delta-delta ( $\Delta\Delta$ ) coefficients, creating 39-dimensional vectors for each frame (25 ms window, 10 ms step). Cepstral Mean and Variance Normalization (CMVN) is applied per speaker, and speaker variability is compensated using fMLLR during SAT.

The forced alignment process first maps orthographic words to phones via a pronunciation lexicon, expands this mapping into a triphone state sequence, then conducts Viterbi decoding to find the optimal state sequence—yielding time-stamped boundaries at transitions between phones and words (Rousso et al., 2024, Chodroff, 2018). The minimal boundary resolution is fixed by the feature frame rate (10 ms). Decoding graphs are compiled as weighted finite-state transducers (WFSTs), combining context-dependency, lexicon, and acoustic model information.

2. Installation, Workflow, and Inputs/Outputs

MFA is distributed as a Python package, with bundled C++ Kaldi binaries, and can be installed on Linux, macOS, and Windows. Prerequisites include Python 3.6+, SoX for audio conversion, and a functioning C++ compiler (Chodroff, 2018). Key components include mfa_download (model downloader), mfa_generate_dictionary (G2P lexicon generator), and mfa_align (main alignment utility).

The typical forced-alignment workflow involves:

Downloading pretrained acoustic and dictionary models via mfa_download.
Preparing audio (mono, 16 kHz WAV), transcripts (.txt/.lab or Praat TextGrids), and lexicon files.
Running forced alignment with mfa_align, which produces Praat TextGrids with phone and word tiers.
(Optionally) exporting alignments in alternative formats via mfa_export.

Required inputs:

Audio: mono WAV files, 16 kHz sampling rate.
Transcripts: utterance-level intervals (TextGrid) or plain text files.
Pronunciation lexicon: mapping words to phone sequences.
Acoustic model zip file.

Outputs include TextGrids with time-aligned word/phone tiers, CTM files, and auxiliary alignment statistics (Chodroff, 2018).

3. Evaluation Metrics and Comparative Performance

MFA’s alignment quality is measured against manual ground truth using mean and median absolute deviation of boundaries, proportion of boundaries within set tolerances (e.g., ≤10/25/50/100 ms), and F₁ score at a given threshold (where F₁ reduces to accuracy) (Rousso et al., 2024, Kelley et al., 2023).

On standard datasets (TIMIT, Buckeye):

Word-level: On TIMIT, 41.6% boundaries fall within 10 ms of ground truth, 72.8% within 25 ms, 89.4% within 50 ms. MFA median deviation is 12.5 ms, mean 21.9 ms; F₁@20 ms = 65.7% (Rousso et al., 2024).
Phone-level: 38.6% within 10 ms, up to 84.6% within 100 ms (Rousso et al., 2024).
Compared to end-to-end ASR methods (WhisperX, MMS), MFA achieves higher precision—particularly sub-20 ms boundaries—due to its explicit phoneme-state modeling and frame-level temporal resolution (Rousso et al., 2024).

Advances via interpolation (MAPS) further improve sub-frame precision: 60.48% boundaries within 10 ms versus MFA’s 47.28% (a 27.9% increase); mean boundary error is reduced from 19.12 ms (MFA) to 17.80 ms (MAPS “crisp”+interpolation) (Kelley et al., 2023).

4. Multilingual and Cross-Domain Applications

MFA accommodates large-scale multilingual and cross-domain alignment tasks. Pretrained acoustic models are available for many languages, and practitioners can train new models or extend lexicons to handle under-resourced or related languages. For low-resource field data, transfer learning from large English models yields substantial gains: fine-tuned models halve boundary errors compared to scratch training, with mean phone boundary errors μ ≈ 1.2–1.5 ms for adapted versus μ ≈ 3–4 ms for scratch-trained (Tosolini et al., 9 Apr 2025).

Hyperparameters for multilingual deployment include feature type selection (MFCC/filterbanks), training epochs and learning rates, regularization, and alignment tolerances. Model adaptation protocols involve L2 regularization and staged fine-tuning, with performance validated using mean boundary error and vowel space dispersion metrics. Multilingual pooling further enhances generalization to related languages (Tosolini et al., 9 Apr 2025).

5. Integration with Downstream Speech and NLP Tasks

MFA is used not only for phone and word alignment, but as a foundational component in pipelines mining higher-level linguistic features from speech–text parallel data. In cross-domain Chinese word segmentation, MFA enables character-level alignment for mining candidate word boundaries: frame gaps between character spans are interpreted as pause durations ( $d(y_i,y_{i+1}) = (b_{i+1} - e_{i}) \times 10\ \mathrm{ms}$ ) (Wang et al., 2024). A subsequent probability-based filtering step employs a BERT-CRF model to assess boundary likelihoods ( $p^b(y,i) = \sum_{l \in \{S_S, S_B, E_S, E_B\}} p(l|y,i)$ ), with high-confidence boundaries ( $p^b \ge 0.5$ ) retained. The Complete-Then-Train (CTT) strategy incorporates these boundaries as hard constraints in new segmenter training, yielding robust cross-domain gains over corpus-only training (Wang et al., 2024).

6. Technical Limitations and Prospective Enhancements

MFA’s temporal precision is limited by the fixed frame rate (typically 10 ms); minimal phone duration is bounded by 3 frames (30 ms). Interpolation techniques, as demonstrated by MAPS, overcome this by sub-grid boundary estimation, matching the experimental needs of phoneticians for phenomena like short-lag VOT (Kelley et al., 2023). DNN models employing multi-label “tagging” improve framewise accuracy but do not inherently improve alignment under classic Viterbi-style decoders; joint acoustic-modeling/decoder enhancements are a future avenue.

Quantitative comparisons with end-to-end neural aligners such as NeuFA show slight numerical advantages for neural methods (mean absolute error for NeuFA: 23.7 ms word-level vs MFA 25.8 ms; phoneme: 15.7 ms vs MFA 18.0 ms), but HMM-based MFA alignments remain state of the art for fine-grained, explicit phone/word timing (Li et al., 2022). End-to-end systems benefit from simplified deployment but require large-scale annotated training data and more robust post-processing (Li et al., 2022).

7. Best Practices and User Recommendations

Successful MFA deployment requires consistent audio preparation (mono, 16 kHz WAV), careful transcript and lexicon management, and hyperparameter optimization. Users should ensure separation of input/output directories to prevent data loss, avoid boundary placement at utterance edges, preprocess noisy field data with denoising/voice activity detection, and supply speaker metadata for adaptation. For unseen or low-resource languages, adapting pretrained models is recommended over scratch training unless sufficient data is available (Tosolini et al., 9 Apr 2025).

Command-line usage is encapsulated in three main scripts, supporting end-to-end alignment into TextGrid format, with extensibility for custom acoustic and G2P model training (Chodroff, 2018). For advanced use cases (crosslingual adaptation, cross-domain mining, or sub-frame precision), integration of interpolation, transfer learning, and LLM-based post-processing is advisable (Wang et al., 2024, Kelley et al., 2023, Tosolini et al., 9 Apr 2025).