Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dirichlet-Compound-Multinomial Distribution

Updated 12 January 2026
  • The Dirichlet-Compound-Multinomial distribution is a hierarchical model that integrates a Dirichlet prior with a multinomial likelihood to account for overdispersion and positive correlation in count data.
  • It provides closed-form expressions for the marginal PMF, moments, and covariances, enabling tractable Bayesian inference and efficient maximum likelihood estimation.
  • Applications span forensic genetics, machine learning, and ecology, with uses in online prediction and hierarchical modeling to manage unobserved heterogeneity.

The Dirichlet-Compound-Multinomial (DCM) distribution—also known as the Dirichlet-multinomial or Pólya distribution—is a fundamental discrete multivariate model describing overdispersed multinomial counts. It arises as the marginal distribution induced by integrating a multinomial likelihood against a Dirichlet prior on the category probabilities. DCMs generalize the multinomial model, enabling modeling of positive correlations and unobserved heterogeneity (e.g., in allele frequencies or word proportions). DCMs are extensively used in Bayesian inference, information theory, forensic genetics, plant ecology, and machine learning, with applications ranging from DNA mixture analysis to online sequence estimation (Tvedebrink et al., 2014, Hutter, 2013, Damgaard, 2018, Sklar, 2014).

1. Construction, Hierarchical Model, and Closed Form

The DCM distribution is defined via a two-stage hierarchical process. For mm categories and nn trials, let p=(p1,...,pm)p = (p_1, ..., p_m) denote the vector of latent category probabilities and X=(X1,...,Xm)X = (X_1, ..., X_m) the observed counts, with iXi=n\sum_i X_i = n. The hierarchical specification is:

  • pDirichlet(α1,...,αm)p \sim \mathrm{Dirichlet}(\alpha_1, ..., \alpha_m), with all αi>0\alpha_i > 0.
  • Conditional on pp, XpMultinomialn(p)X | p \sim \mathrm{Multinomial}_n(p).

The marginal PMF, after integrating out pp, is

P(X1=k1,...,Xm=km)=n!i=1mki!Γ(α0)Γ(n+α0)i=1mΓ(ki+αi)Γ(αi)P(X_1 = k_1, ..., X_m = k_m) = \frac{n!}{\prod_{i=1}^m k_i!} \cdot \frac{\Gamma(\alpha_0)}{\Gamma(n + \alpha_0)} \prod_{i=1}^m \frac{\Gamma(k_i + \alpha_i)}{\Gamma(\alpha_i)}

where α0:=i=1mαi\alpha_0 := \sum_{i=1}^m \alpha_i and the kik_i are non-negative integers summing to nn (Tvedebrink et al., 2014, Damgaard, 2018). This forms an exchangeable law across the categorical variables and generalizes the beta-binomial in the m=2m=2 case. The DCM is also known as the Polya urn distribution due to its equivalence with sampling under reinforcement.

2. Moments, Overdispersion, and Parametric Interpretation

The moments illustrate the “extra-multinomial” correlation induced by Dirichlet mixing. Setting qi=αi/α0q_i = \alpha_i/\alpha_0 and θ=1/(1+α0)\theta = 1/(1+\alpha_0), the principal moments are:

  • E[Xi]=nqi\mathbb{E}[X_i] = n q_i
  • Var[Xi]=nqi(1qi)[1+(n1)θ]\mathrm{Var}[X_i] = n q_i (1-q_i)[1 + (n-1)\theta]
  • Cov[Xi,Xj]=nqiqj[1+(n1)θ]\mathrm{Cov}[X_i, X_j] = -n q_i q_j [1 + (n-1)\theta] (ij)(i \neq j)

As α0\alpha_0 \to \infty, the DCM contracts to a multinomial with independent counts; as α00+\alpha_0 \to 0^+, overdispersion becomes maximal, and all counts cluster in a single category (Tvedebrink et al., 2014, Damgaard, 2018). The αi\alpha_i may be interpreted as pseudocounts, with the total α0\alpha_0 controlling the strength of overdispersion. The limit θ0\theta \rightarrow 0, i.e., α0\alpha_0 \to \infty, yields classical multinomial inference.

Table: Moment Formulas for DCM Counts

Statistic Formula Notes
E[Xi]\mathbb{E}[X_i] nqi=nαi/α0n q_i = n \alpha_i / \alpha_0 Prior mean times nn
Var[Xi]\mathrm{Var}[X_i] nqi(1qi)[1+(n1)θ]n q_i(1-q_i)[1 + (n-1)\theta] Overdispersion appears
Cov[Xi,Xj]\mathrm{Cov}[X_i,X_j] (iji\neq j) nqiqj[1+(n1)θ]-n q_i q_j [1 + (n-1)\theta] Negative covariance

3. Multivariate Generalization and Hierarchical Models

The multivariate Dirichlet-multinomial (MDM) generalizes the DCM to multiple related multinomial draws sharing a common or hierarchically structured set of probabilities. In forensic genetics, this structure arises when modeling counts of alleles across multiple contributors, all drawing from a latent allele frequency vector: P({ni,a})=i=1I(ni,ni,1,...,ni,A)Γ(α0)Γ(α0+N)a=1AΓ(αa+n,a)Γ(αa)P(\{n_{i,a}\}) = \prod_{i=1}^I \binom{n_{i,\bullet}}{n_{i,1}, ..., n_{i,A}} \cdot \frac{\Gamma(\alpha_0)}{\Gamma(\alpha_0 + N)} \prod_{a=1}^{A} \frac{\Gamma(\alpha_a + n_{\bullet, a})}{\Gamma(\alpha_a)} with N=i,ani,aN = \sum_{i,a} n_{i,a} and n,a=ini,an_{\bullet,a} = \sum_i n_{i,a} (Tvedebrink et al., 2014). Marginals and conditionals of the MDM law retain DCM structure, facilitating tractable computations within Bayesian networks and hierarchical models (Tvedebrink et al., 2014, Damgaard, 2018).

Reparametrizations, such as (μ,δ)(\boldsymbol{\mu}, \delta) (mean vector and aggregation), allow direct modeling of mean effects and overdispersion: αi=1δδμi\alpha_i = \frac{1-\delta}{\delta} \mu_i, iμi=1\sum_i \mu_i = 1, 0<δ<10 < \delta < 1, and

Var(Yi)=Nμi(1μi)[1+(N1)δ]\mathrm{Var}(Y_i) = N \mu_i(1 - \mu_i) [1 + (N-1) \delta]

enabling use in hierarchical and covariate models as in plant cover applications (Damgaard, 2018).

4. Parameter Estimation and Computational Methods

For parameter estimation, maximum likelihood is typically employed, with the log-likelihood across MM independent DCM draws (count vectors x(n)x^{(n)}) expressed as: L(α)=n=1M[k=1KlogΓ(xn,k+αk)logΓ(N+A)+logΓ(A)k=1KlogΓ(αk)]L(\alpha) = \sum_{n=1}^M \left[\sum_{k=1}^K \log \Gamma(x_{n,k} + \alpha_k) - \log\Gamma(N + A) + \log\Gamma(A) - \sum_{k=1}^K \log\Gamma(\alpha_k)\right] where A=kαkA = \sum_k \alpha_k. Gradients and Hessians are available in closed form using digamma and trigamma functions, supporting Newton-Raphson optimization (Sklar, 2014).

A significant computational advancement is the fast one-pass algorithm for DCM MLE, which precomputes sufficient statistics (U,v)(U, v) in a single data scan and decouples further Newton-Raphson iterations from dataset size—substantially reducing computational cost for large NN and moderate per-row counts. Empirically, speedups of 10210^2103×10^3\times over the standard method are possible for large sample sizes (Sklar, 2014).

5. Bayesian Predictive Inference and Online Estimation

The DCM admits closed-form Bayesian predictive rules: P(xt+1=ix1:t,α)=nit+αit+α+P(x_{t+1} = i\,|\,x_{1:t}, \boldsymbol{\alpha}) = \frac{n_i^t + \alpha_i}{t + \alpha_+} with nitn_i^t the current count of category ii. For high-dimensional sparse-alphabet problems, it is critical to appropriately adapt the concentration parameter. Empirically optimal sparsity-adaptive choices set α+=m/(2ln[(n+1)/m])\alpha_+ = m / (2 \ln[(n+1)/m]), with mm denoting the number of observed distinct categories—yielding tight, data-dependent redundancy bounds and efficient (O(1)O(1) per-symbol) online algorithms (Hutter, 2013). This estimator ensures zero redundancy for unseen categories, uniform effectiveness across various mm, and competitiveness with fully Bayesian subalphabet mixtures.

Desirable properties are summarized as follows:

Property Description
Online/Sequential Updates require only current counts and mm
Computational Cost O(1)O(1) per symbol for basic prediction, O(logD)O(\log D) for arithmetic coding sums
Redundancy Zero for never-seen symbols; bounded for finite frequency; optimality for constants
Alphabet Independence Bounds are independent of full base alphabet size DD

6. Domain-Specific Applications and Model Embeddings

DCMs are central to adjusting for unobserved heterogeneity or correlation across multinomial samples:

  • In forensic genetics, the θ\theta-correction (θ1/(1+α0)\theta \approx 1/(1+\alpha_0)) incorporates subpopulation structure and remote ancestry, increasing the probability of homozygosity (joint rare allele counts) and thereby moderating evidential weight in DNA analysis. For pairs of profiles, the law for a specific allele is a Beta-binomial kernel (Tvedebrink et al., 2014).
  • In plant ecology, reparameterized DCMs model joint pin-point cover data, with the aggregation parameter encoding spatial clumping; in Bayesian hierarchical models, mean cover parameters μ\mu are modeled as latent variables or as functions of covariates (Damgaard, 2018). Embedding the DCM in graphical models, particularly junction trees or Bayesian networks, preserves computational tractability by exploiting conditional independence and allows for efficient exact inference with smaller maximal cliques after marginalizing frequency nodes (Tvedebrink et al., 2014).

7. Limiting Cases, Special Forms, and Distributional Connections

The DCM encompasses several classic distributions:

  • Reduces to the multinomial for α0\alpha_0 \rightarrow \infty (no overdispersion).
  • Reduces to the beta-binomial for two categories.
  • Connects to the negative-multinomial (when nn is itself random) and Pólya’s urn sampling scheme (inductive reinforcement).

Sampling consistency holds: the marginal over any subcollection of categories is again Dirichlet-multinomial, making the DCM structurally robust under aggregation (Tvedebrink et al., 2014). The DCM’s inherent conjugacy to the multinomial supports straightforward Bayesian updating, facilitating both closed-form and simulation-based inference.

References

  • (Tvedebrink et al., 2014) The multivariate Dirichlet-multinomial distribution and its application in forensic genetics to adjust for sub-population effects using the θ-correction.
  • (Hutter, 2013) Sparse Adaptive Dirichlet-Multinomial-like Processes.
  • (Sklar, 2014) Fast MLE Computation for the Dirichlet Multinomial.
  • (Damgaard, 2018) The joint distribution of pin-point plant cover data: a reparametrized Dirichlet -- multinomial distribution.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dirichlet-Compound-Multinomial (DCM) Distribution.