Dirichlet-Compound-Multinomial Distribution
- The Dirichlet-Compound-Multinomial distribution is a hierarchical model that integrates a Dirichlet prior with a multinomial likelihood to account for overdispersion and positive correlation in count data.
- It provides closed-form expressions for the marginal PMF, moments, and covariances, enabling tractable Bayesian inference and efficient maximum likelihood estimation.
- Applications span forensic genetics, machine learning, and ecology, with uses in online prediction and hierarchical modeling to manage unobserved heterogeneity.
The Dirichlet-Compound-Multinomial (DCM) distribution—also known as the Dirichlet-multinomial or Pólya distribution—is a fundamental discrete multivariate model describing overdispersed multinomial counts. It arises as the marginal distribution induced by integrating a multinomial likelihood against a Dirichlet prior on the category probabilities. DCMs generalize the multinomial model, enabling modeling of positive correlations and unobserved heterogeneity (e.g., in allele frequencies or word proportions). DCMs are extensively used in Bayesian inference, information theory, forensic genetics, plant ecology, and machine learning, with applications ranging from DNA mixture analysis to online sequence estimation (Tvedebrink et al., 2014, Hutter, 2013, Damgaard, 2018, Sklar, 2014).
1. Construction, Hierarchical Model, and Closed Form
The DCM distribution is defined via a two-stage hierarchical process. For categories and trials, let denote the vector of latent category probabilities and the observed counts, with . The hierarchical specification is:
- , with all .
- Conditional on , .
The marginal PMF, after integrating out , is
where and the are non-negative integers summing to (Tvedebrink et al., 2014, Damgaard, 2018). This forms an exchangeable law across the categorical variables and generalizes the beta-binomial in the case. The DCM is also known as the Polya urn distribution due to its equivalence with sampling under reinforcement.
2. Moments, Overdispersion, and Parametric Interpretation
The moments illustrate the “extra-multinomial” correlation induced by Dirichlet mixing. Setting and , the principal moments are:
As , the DCM contracts to a multinomial with independent counts; as , overdispersion becomes maximal, and all counts cluster in a single category (Tvedebrink et al., 2014, Damgaard, 2018). The may be interpreted as pseudocounts, with the total controlling the strength of overdispersion. The limit , i.e., , yields classical multinomial inference.
Table: Moment Formulas for DCM Counts
| Statistic | Formula | Notes |
|---|---|---|
| Prior mean times | ||
| Overdispersion appears | ||
| () | Negative covariance |
3. Multivariate Generalization and Hierarchical Models
The multivariate Dirichlet-multinomial (MDM) generalizes the DCM to multiple related multinomial draws sharing a common or hierarchically structured set of probabilities. In forensic genetics, this structure arises when modeling counts of alleles across multiple contributors, all drawing from a latent allele frequency vector: with and (Tvedebrink et al., 2014). Marginals and conditionals of the MDM law retain DCM structure, facilitating tractable computations within Bayesian networks and hierarchical models (Tvedebrink et al., 2014, Damgaard, 2018).
Reparametrizations, such as (mean vector and aggregation), allow direct modeling of mean effects and overdispersion: , , , and
enabling use in hierarchical and covariate models as in plant cover applications (Damgaard, 2018).
4. Parameter Estimation and Computational Methods
For parameter estimation, maximum likelihood is typically employed, with the log-likelihood across independent DCM draws (count vectors ) expressed as: where . Gradients and Hessians are available in closed form using digamma and trigamma functions, supporting Newton-Raphson optimization (Sklar, 2014).
A significant computational advancement is the fast one-pass algorithm for DCM MLE, which precomputes sufficient statistics in a single data scan and decouples further Newton-Raphson iterations from dataset size—substantially reducing computational cost for large and moderate per-row counts. Empirically, speedups of – over the standard method are possible for large sample sizes (Sklar, 2014).
5. Bayesian Predictive Inference and Online Estimation
The DCM admits closed-form Bayesian predictive rules: with the current count of category . For high-dimensional sparse-alphabet problems, it is critical to appropriately adapt the concentration parameter. Empirically optimal sparsity-adaptive choices set , with denoting the number of observed distinct categories—yielding tight, data-dependent redundancy bounds and efficient ( per-symbol) online algorithms (Hutter, 2013). This estimator ensures zero redundancy for unseen categories, uniform effectiveness across various , and competitiveness with fully Bayesian subalphabet mixtures.
Desirable properties are summarized as follows:
| Property | Description |
|---|---|
| Online/Sequential | Updates require only current counts and |
| Computational Cost | per symbol for basic prediction, for arithmetic coding sums |
| Redundancy | Zero for never-seen symbols; bounded for finite frequency; optimality for constants |
| Alphabet Independence | Bounds are independent of full base alphabet size |
6. Domain-Specific Applications and Model Embeddings
DCMs are central to adjusting for unobserved heterogeneity or correlation across multinomial samples:
- In forensic genetics, the -correction () incorporates subpopulation structure and remote ancestry, increasing the probability of homozygosity (joint rare allele counts) and thereby moderating evidential weight in DNA analysis. For pairs of profiles, the law for a specific allele is a Beta-binomial kernel (Tvedebrink et al., 2014).
- In plant ecology, reparameterized DCMs model joint pin-point cover data, with the aggregation parameter encoding spatial clumping; in Bayesian hierarchical models, mean cover parameters are modeled as latent variables or as functions of covariates (Damgaard, 2018). Embedding the DCM in graphical models, particularly junction trees or Bayesian networks, preserves computational tractability by exploiting conditional independence and allows for efficient exact inference with smaller maximal cliques after marginalizing frequency nodes (Tvedebrink et al., 2014).
7. Limiting Cases, Special Forms, and Distributional Connections
The DCM encompasses several classic distributions:
- Reduces to the multinomial for (no overdispersion).
- Reduces to the beta-binomial for two categories.
- Connects to the negative-multinomial (when is itself random) and Pólya’s urn sampling scheme (inductive reinforcement).
Sampling consistency holds: the marginal over any subcollection of categories is again Dirichlet-multinomial, making the DCM structurally robust under aggregation (Tvedebrink et al., 2014). The DCM’s inherent conjugacy to the multinomial supports straightforward Bayesian updating, facilitating both closed-form and simulation-based inference.
References
- (Tvedebrink et al., 2014) The multivariate Dirichlet-multinomial distribution and its application in forensic genetics to adjust for sub-population effects using the θ-correction.
- (Hutter, 2013) Sparse Adaptive Dirichlet-Multinomial-like Processes.
- (Sklar, 2014) Fast MLE Computation for the Dirichlet Multinomial.
- (Damgaard, 2018) The joint distribution of pin-point plant cover data: a reparametrized Dirichlet -- multinomial distribution.