Probabilistic Count Matrix Factorization
- Probabilistic Count Matrix Factorization (pCMF) is a Bayesian latent factor model designed for non-negative count data, incorporating Poisson likelihoods and Gamma priors.
- It effectively handles over-dispersion, zero inflation, and sparsity, making it suitable for high-dimensional applications like single-cell RNA-seq and document-term matrices.
- Efficient inference via variational EM and Monte Carlo EM enables rapid convergence, robust clustering, and dimension reduction in complex datasets.
Probabilistic Count Matrix Factorization (pCMF) is a class of generative latent factor models specifically tailored for non-negative valued count data, with particular effectiveness for sparse, over-dispersed, and zero-inflated matrices such as those encountered in single-cell RNA sequencing and document-term matrices. The defining characteristic of pCMF is its probabilistic hierarchical structure built upon Poisson likelihoods and Gamma-distributed priors for latent factors, along with explicit mechanisms for zero inflation and sparsity. This framework generalizes standard non-negative matrix factorization (NMF) to a fully Bayesian context and provides a model-based foundation for dimension reduction, clustering, and data compression in regimes characterized by high sparsity and heteroscedasticity (Durif et al., 2017, Filstroff et al., 2018, Acharya et al., 2015).
1. Hierarchical Model Specification
At the core of pCMF is the Poisson–Gamma matrix factor model. Given an observed count matrix (e.g., cells by genes or documents by words), each count is independently modeled as: where and are latent, non-negative matrices representing cell/gene (or document/word) embeddings.
To accommodate over-dispersion, Gamma priors are placed:
Zero inflation, often arising from dropout effects, is incorporated via latent Bernoulli indicators : yielding an explicit marginal probability for zeros:
Gene selection and sparsity in factor usage are implemented via spike-and-slab priors:
Extensions to nonparametric Bayesian settings include gamma process priors, as in the dynamic Poisson factor analysis (DPFA), and Dirichlet-distributed loadings for topic models, with Bernoulli–Poisson links enabling modeling of binary data (Acharya et al., 2015).
2. Inference Algorithms: Variational EM and Monte Carlo EM
Posterior inference for pCMF is intractable analytically due to the non-conjugate structure induced by zero-inflation and spike-and-slab priors. Efficient inference is achieved primarily through:
Variational EM (as in (Durif et al., 2017)):
- The posterior is approximated via a mean-field variational distribution, factorizing over cells, genes, and factors.
- Each group of latent variables () is assigned a standard exponential family distribution: Gamma for latent factors, Multinomial for auxiliary counts, and Bernoulli for selection and dropout variables.
- The Evidence Lower Bound (ELBO) is maximized by coordinate ascent, with all updates available in closed form:
- Shape/rate parameters of adjoint Gamma distributions updated by matching to sufficient statistics of expected counts.
- Multinomial weights for auxiliary counts updated in proportion to expected log-factors.
- Dropout posteriors updated for observed zeros via the logit of the dropout probability.
- Hyperparameters () updated by moment-matching.
Monte Carlo EM for MMLE (Filstroff et al., 2018):
- In the Gamma–Poisson (GaP) model variant, the posterior is marginalized over the latent score/activation matrix , yielding a closed-form marginal likelihood involving hidden negative-multinomial auxiliary counts.
- The MCEM-C variant uses short Gibbs-augmented Monte Carlo sampling of component counts to approximate the complete-data likelihood, and the M-step updates dictionary weights in closed form.
- This approach induces a group-sparsity penalty on column norms of and prunes superfluous components in highly overparameterized models without explicit sparsity-inducing priors.
Both inference paradigms exhibit rapid convergence and scale linearly per iteration in the matrix size and number of factors. The MCEM variant, in particular, demonstrates robustness to factor number specification and stability in large, highly sparse matrices (Durif et al., 2017, Filstroff et al., 2018).
3. Geometry, Low-Dimensional Embedding, and Statistical Metrics
pCMF induces a data geometry distinct from standard PCA. While PCA projects data by minimizing squared error (Euclidean/Bregman divergence for Gaussians), pCMF leverages the Poisson likelihood: which is the natural Bregman divergence for Poisson data.
Low-dimensional representations for visualization or clustering are directly obtained from the posterior means of the factor matrices: for cells, the -dimensional embedding is ; for genes, . These axes can be interpreted as latent metagenes or meta-cells.
A generalization of explained variance to count models is provided by “percentage of explained deviance”: pCMF achieves notably higher explained deviance (70–80%) compared to PCA or Poisson NMF (30–50%) on benchmark single-cell data (Durif et al., 2017).
4. Applications, Clustering, and Empirical Performance
pCMF has demonstrated particular relevance in single-cell RNA-seq analysis, where data matrices are large, zero-inflated, and over-dispersed. Application includes:
- Unsupervised cell-type identification via -dimensional pCMF embeddings followed by k-means or hierarchical clustering. Adjusted Rand Index (ARI) measurements on both simulated and real benchmarks confirm systematic improvement over PCA, NMF, and zero-inflated factor analysis (ZIFA), especially at high dropout and noise rates.
- Visualization of high-dimensional data, where pCMF’s log-mean factor embeddings provide more coherent biological separation than principal component axes or t-SNE, with the additional advantage of being generative and able to embed new observations directly.
- Dimensionality reduction for downstream gene selection and feature extraction, thanks to the model’s spike-and-slab priors (Durif et al., 2017).
5. Comparisons with Alternative Methods
| Method | Model Formulation | Limitations Addressed by pCMF |
|---|---|---|
| PCA | ℓ₂ loss, Gaussian noise | Ignores discreteness, over-dispersion |
| t-SNE | Non-parametric, KL-divergence | Not generative, sensitive to hyperparameters |
| NMF | Non-negative, Poisson or ℓ₂ loss | Cannot model zero-inflation/sparsity in factors |
| ZIFA | Gaussian ZF on log-counts | Approximates dropouts, not discrete counts |
pCMF directly models zero-inflated discrete (count) data, is robust to heavy tails, and provides both generative cell/gene representations, in contrast to t-SNE or classical NMF. It is computationally competitive (per-iteration ), typically requiring 50–100 iterations for convergence. In over-specified settings, MMLE-based pCMF with marginalization of latent scores inherently induces automatic factor pruning—a property not found in alternating MAP-based updates for (W,H) (Durif et al., 2017, Filstroff et al., 2018).
6. Practical Implementation and Model Selection
Recommended practical steps include:
- Initialization by sampling latent factors from Gamma distributions such that ; dropout rate priors () set to observed proportions of non-zeros per gene; selection priors () initialized via variance heuristics.
- Likelihood hyperparameters are commonly set to vague values (), with automatic update via M-step.
- Model order selection (number of latent factors ) via explained deviance curve, selecting the “elbow” point.
- Convergence monitored by relative increase in ELBO or parameter change, with termination at a relative threshold () or fixed iteration count.
- The R package “pCMF” (with C++ core) implements these methods (github.com/gdurif/pCMF) (Durif et al., 2017).
7. Extensions and Theoretical Insights
Generalizations include static and dynamical formulations: dynamic Poisson factor analysis (DPFA) models temporal count matrices using a gamma-Markov evolution for factors, further admitting nonparametric Bayesian extension through gamma processes as the number of factors grows unbounded (Acharya et al., 2015). Marginal likelihood derivations in GaP models (Filstroff et al., 2018) establish that integrating out latent scores confers a natural, group-wise sparsity penalty on the dictionary matrix, providing self-regularization and robustness to over-sizing of the latent dimension. This suggests that pCMF can effectively operate without explicit nonparametric priors to prune redundant factors, an important property in high-dimensional regimes and model selection.
The theoretical and algorithmic advances in pCMF establish it as a foundational method for probabilistic modeling of high-dimensional count matrices, with broad applicability to genomics, natural language processing, and beyond (Durif et al., 2017, Filstroff et al., 2018, Acharya et al., 2015).