Hidden Markov Models (HMM)

Updated 10 February 2026

Hidden Markov Models (HMM) are stochastic models that use a hidden Markov process to represent sequential data with observable outputs.
They employ algorithms like forward-backward and EM, along with variational Bayesian methods, for efficient parameter estimation and inference.
Advanced HMM extensions incorporate neural emissions, hierarchical priors, and scalable techniques to enhance analysis in high-dimensional settings.

A hidden Markov model (HMM) is a latent variable model for sequential data where an unobserved (hidden) discrete-time Markov chain governs the evolution of an observed process. The model factorizes the observed sequence likelihood according to an initial distribution, a set of hidden state transition probabilities, and a set of state-dependent emission distributions governing the observables. HMMs are widely used for time series, language modeling, biological sequence analysis, single-molecule biophysics, and numerous more specialized domains. Modern developments extend HMMs with hierarchical Bayesian priors, mixture and neural emission distributions, scalable parameterizations, and flexible inference schemes applicable to high-dimensional and high-throughput settings.

1. Mathematical Structure and Basic Inference

An HMM comprises a discrete latent state process $z_{1:N}$ evolving on $\{1,\dots,J\}$ with initial distribution $\pi_j = P(z_1 = j)$ , transition matrix $A_{ij} = P(z_{n+1} = j \mid z_n = i)$ , and an emission model $x_n \sim P(x_n \mid z_n = j)$ , which may be discrete or continuous. The complete-data likelihood jointly over $z_{1:N}$ and $x_{1:N}$ factorizes as: $p(x_{1:N}, z_{1:N} \mid \pi, A, \{\textrm{emission params}\}) = \pi_{z_1} \left[\prod_{n=1}^{N-1} A_{z_n, z_{n+1}}\right] \left[\prod_{n=1}^N P(x_n \mid z_n)\right]$ Inference tasks include computing the marginal likelihood $p(x_{1:N})$ , state posteriors $P(z_n = j \mid x_{1:N})$ , and decoding the most probable latent sequence (Viterbi decoding).

Efficient computation is supported by the forward–backward algorithm, which propagates $\{1,\dots,J\}$ 0 (“forward messages”) and $\{1,\dots,J\}$ 1 (“backward messages”) recursively using the Markov and emission structure. This supports both E-step expected sufficient statistics and posterior state marginals for parameter estimation and prediction (Luong et al., 2012).

2. Learning Algorithms: EM and Bayesian Methods

Maximum likelihood estimation for HMMs is performed using the Baum–Welch algorithm, a special case of the expectation–maximization (EM) algorithm. The procedure alternates:

E-step: Compute the expected sufficient statistics (occupancies $\{1,\dots,J\}$ 2, transitions $\{1,\dots,J\}$ 3) under the current parameter values using forward–backward recursions.
M-step: Maximize the expected complete-data log-likelihood with respect to parameters, yielding closed-form updates for multinomial (Dirichlet-conjugate) initial and transition distributions, and—for standard emission types (e.g., Gaussian, Poisson)—updates for emission parameters (Luong et al., 2012).

Variational Bayesian (VB) methods extend this framework: instead of maximizing the likelihood, they approximate the full posterior over the parameters and hidden states with a tractable factorized distribution $\{1,\dots,J\}$ 4, maximizing the evidence lower bound (ELBO). For HMMs with multivariate Gaussian emissions, conjugate Dirichlet priors are placed on $\{1,\dots,J\}$ 5 and $\{1,\dots,J\}$ 6, and normal–Wishart priors on emission mean–precisions. Posterior updates proceed analogously, but with parameter “soft counts” reflected in updated Dirichlet and normal–Wishart hyperparameters. VB-HMM regularizes parameters, prevents singularity pathologies (e.g., variance collapse), and supports automatic state pruning via posterior contraction of superfluous parameters (Gruhl et al., 2016).

3. Model Selection and State Number Estimation

Selecting the number of hidden states $\{1,\dots,J\}$ 7 in an HMM is a nontrivial model selection problem, with classical approaches relying on penalized-likelihood criteria such as BIC. However, BIC can be inconsistent or misleading when the likelihood is unbounded (as in Gaussian emissions with unconstrained variance) or under strong multimodality.

Marginal likelihood methods provide a consistent alternative: for each $\{1,\dots,J\}$ 8, integrate out all parameters (and states), yielding $\{1,\dots,J\}$ 9, with a suitable prior $\pi_j = P(z_1 = j)$ 0 over parameters. The optimal $\pi_j = P(z_1 = j)$ 1 maximizes this marginal likelihood. The consistency of marginal likelihood selection is established: for $\pi_j = P(z_1 = j)$ 2, $\pi_j = P(z_1 = j)$ 3 decays exponentially, while for $\pi_j = P(z_1 = j)$ 4, the overfit penalty is polynomial in $\pi_j = P(z_1 = j)$ 5, ensuring consistency (Chen et al., 2024). Efficient reciprocal importance sampling estimators leverage posterior MCMC samples for practical computation, outperforming BIC in low-SNR and small-sample regimes.

Mixture-emission HMMs pose a related model selection problem: the hierarchical Markov-aware merge-and-prune strategy combines initialization with many mixture subcomponents, hierarchical merging by likelihood-based criteria, and information-theoretic selection (e.g., ICL $\pi_j = P(z_1 = j)$ 6) to robustly identify the true number of hidden states and emission substructure (Volant et al., 2012).

4. Extensions: Flexible Emissions, Priors, and Large-Scale Models

HMM emission models may go beyond simple parametric families:

Mixture Emissions: Each per-state emission distribution may itself be a finite mixture of parametric densities. The emission is then $\pi_j = P(z_1 = j)$ 7, incorporated into EM via an extra latent variable for the subcomponent and additional responsibilities in the E-step (Volant et al., 2012).
Neural Emission Models (GenHMM): Each state’s emissions are parameterized by mixture-of-invertible-flow neural generative models, allowing for tractable density estimation, efficient likelihood computation, and capturing highly non-Gaussian, multimodal distributions. GenHMM applies EM with per-state, per-component flows and mixture weights, yielding significantly improved classification performance in speech and biological sequence tasks compared to GMM-based HMMs (Liu et al., 2019).
Hierarchically Coupled Priors and VEB: When multiple observation traces share an underlying kinetic scheme but exhibit heterogeneous measurement statistics, hierarchically coupled HMMs place shared hyperpriors over local per-trace HMMs. The VEB algorithm iterates between per-trace variational updates and empirical Bayes hyperparameter maximization, yielding robust, interpretable consensus models (Meent et al., 2013).

For scalability and parameter efficiency, several innovations include:

Blocked Emissions and Neural Parameterization: Grouping states and emissions via block structure (e.g., Brown clusters), learning low-rank neural parameterizations for transition and emission matrices with shared embeddings, and applying state dropout to sparsify the active state set at each timestep. These together permit efficient, exact inference and learning in HMMs with very large ( $\pi_j = P(z_1 = j)$ 8) state spaces (Chiu et al., 2020).
Dense Embeddings (DenseHMM): Transition and emission probabilities are constructed as kernelized (e.g., exp-dot-product) functions of learned dense vectors for states and symbols, supporting both gradient-based and modified EM optimization with no hard parameter tables (Sicking et al., 2020).

5. Advanced Inference: Non-Standard Observations and Spectral Methods

For settings where direct EM is prohibitive or estimation requires leveraging non-sequential statistics:

Non-negative Matrix Factorization (NMF): Higher-order (prefix-suffix) statistics are aggregated into a large non-negative matrix $\pi_j = P(z_1 = j)$ 9; NMF (with I-divergence) identifies the minimal rank ( $A_{ij} = P(z_{n+1} = j \mid z_n = i)$ 0) and parameters (state emission/transition) via algebraic extraction. This approach supports consistent order estimation and parameter recovery and sidesteps the need for repeated forward–backward passes (0809.4086).
Pairwise Co-occurrence Factorizations: For large-sample but short-memory settings, a second-order output probability matrix $A_{ij} = P(z_{n+1} = j \mid z_n = i)$ 1 is factorized as $A_{ij} = P(z_{n+1} = j \mid z_n = i)$ 2, with determinant-minimizing regularization under a “sufficiently scattered” emission matrix assumption, yielding identifiability up to permutation. Application to topic modeling (HTMM) shows improved topic-coherence and perplexity over standard bag-of-words and NMF anchor-word methods (Huang et al., 2018).
Dealing with Unlabeled Missingness: If observation timestamps are unknown and observation sequences are partial, Gibbs samplers alternating reconstruction of hidden alignments and state paths with model parameter updates allow consistent HMM parameter learning without acyclicity or silent-state constraints (Perets et al., 2022).

6. Modern Neural and Topological HMM Generalizations

Recent work unifies HMM computation with neural modeling:

RNN Formulation (HMRNN): The canonical forward-pass of an HMM is recast as recurrent network state updates; when parameters are network weights, gradient descent matches Baum–Welch EM for a bare HMM, but also enables seamless integration of covariate networks and auxiliary losses (e.g., for disease progression forecasting with patient covariates) (Baucum et al., 2020).
End-to-End Neural HMMs with Explicit Transition Models: Explicit, learnable, local transition models permit joint training of both emission and transition posteriors, supporting enhanced alignment quality and modular hybrid pipelines in end-to-end speech recognition, with efficient GPU-based forward–backward (Mann et al., 2023).
HMMs in Locally Convex Topological Vector Spaces: For infinite-dimensional observations (e.g., functions or stochastic process sample paths), emission densities are replaced by Onsager–Machlup functionals over Cameron–Martin spaces. Classical forward–backward and EM algorithms generalize to this setting, supporting applications in time series of curves, functional neuroimaging, and more (Kashlak et al., 2022).

7. Applications and Current Directions

HMMs and their extensions underpin methodologies in change-point analysis (Luong et al., 2012), topic modeling (Huang et al., 2018), speech/language modeling (Chiu et al., 2020), single-molecule biophysics (Meent et al., 2013), time series from biology, finance, and genomics, and modern neural sequence modeling pipelines (Mann et al., 2023, Liu et al., 2019, Baucum et al., 2020). The continuing evolution of HMMs centers on expressivity (richer emissions, priors), scalability (neural and kernel parameterizations, spectral methods), robust inference (Bayesian and variational methods), and generalization to irregular/missing data, hierarchical datasets, and high-dimensional observations.

Noteworthy contemporary directions include robust Bayesian model selection by marginal likelihood (Chen et al., 2024), neural and flow-based emission models (Liu et al., 2019), hybrid neural–latent frameworks (Baucum et al., 2020, Mann et al., 2023), and infinite-dimensional and geometric extensions (Kashlak et al., 2022). These advances position HMMs as both a fundamental framework and a flexible building block within modern probabilistic and neural sequence modeling pipelines.