Ewens–Pitman Partitions

Updated 30 November 2025

Ewens–Pitman partitions are a two-parameter family of exchangeable random partitions that generalize the classical Ewens sampling formula and the Pitman–Yor process.
They are constructed via predictive models like the Chinese Restaurant Process and stick-breaking representations, providing clear probabilistic and combinatorial insights.
These partitions underpin practical applications in population genetics, Bayesian nonparametrics, and machine learning through their well-defined asymptotic and large deviation behaviors.

Ewens–Pitman partitions constitute a two-parameter family of exchangeable random partitions over $[n] = \{1,\dots, n\}$ , determined by $(\alpha,\theta)$ with either $0 \leq \alpha < 1$ and $\theta > -\alpha$ , or $\alpha < 0$ and $\theta = -m\alpha$ for $m \in \mathbb{N}$ . They interpolate between the classical Ewens sampling formula ( $\alpha=0$ ) and the two-parameter Poisson–Dirichlet (Pitman–Yor) distribution ( $0 < \alpha < 1$ ), and admit deep connections with Gibbs partitions, stable subordinators, generalized Stirling numbers, compound Poisson representations, and the combinatorics of symmetric groups. Rich probabilistic, asymptotic, and algebraic structures underlie these partitions, yielding both practical statistical tools and theoretical insight into fragmentation, random trees, and Bayesian nonparametrics.

1. Formal Definition and Exchangeable Partition Probability Function

A random partition of $[n]$ with block sizes $n_1, \ldots, n_k$ ( $\sum_{j=1}^k n_j = n$ ) is assigned the Ewens–Pitman probability

$p_{\alpha,\theta}(n_1, \ldots, n_k) = \frac{\prod_{i=1}^{k-1}(\theta + i\alpha)}{(\theta+1)_{n-1}} \prod_{j=1}^k (1-\alpha)_{n_j-1}$

where $(a)_m = a(a+1)\dots(a+m-1)$ is the Pochhammer symbol, and $(1-\alpha)_{n_j-1}$ encodes multiplicative block-size weights (Greve, 6 Mar 2025, Dolera et al., 2021).

In combinatorial terms, the total probability of an unordered set partition with sizes $(n_1, ..., n_k)$ is

$M_n(n_1,\ldots,n_k) = \binom{n}{n_1,\ldots,n_k} \; p_{\alpha,\theta}(n_1, \ldots, n_k)$

The parameters must satisfy $0 \leq \alpha < 1$ , $\theta > -\alpha$ or $\alpha < 0$ , $\theta = -m\alpha$ .

This generalizes the Ewens sampling formula ( $\alpha=0$ ), for which

$p_{0,\theta}(n_1,\dots,n_k) = \frac{\theta^k}{(\theta+1)_{n-1}} \prod_{j=1}^k (n_j-1)!$

and the Poisson–Dirichlet process for $0 < \alpha < 1$ .

2. Probabilistic Constructions and Predictive Structure

Ewens–Pitman partitions are equivalently described via the Chinese Restaurant Process (CRP). Given a partial partition of $[n]$ into $K_n$ blocks with sizes $n_1, ..., n_{K_n}$ , the (n+1)-st element joins block $j$ with probability

$\frac{n_j - \alpha}{\theta + n}$

or initiates a new block with probability

$\frac{\theta + \alpha K_n}{\theta + n}$

This sequential construction yields exchangeable distributions on partitions and underpins their representation as de Finetti mixtures over random discrete measures, notably the Pitman–Yor process (Greve, 6 Mar 2025, Dolera et al., 2021, Favaro et al., 2014).

In the stick-breaking representation, for $\alpha \in [0,1)$ , $\theta > -\alpha$ , the mass partition has weights

$V_j = U_j \prod_{i<j}(1 - U_i), \quad U_j \sim \mathrm{Beta}(1-\alpha, \theta + j\alpha)$

and drawing i.i.d. samples from the resulting random measure induces the Ewens–Pitman random partition (Favaro et al., 2016, Ho et al., 2018).

3. Compound Poisson Interpretations

Ewens–Pitman partitions admit an interpretation as mixtures of compound Poisson sampling models (Dolera et al., 2021):

For $\alpha = 0$ (Ewens), block counts correspond to conditioning the total size of a log-series compound Poisson sample (LS-CPSM).
For general $\alpha \in (0,1)$ , block counts arise as mixtures (in $z$ ) over negative-Binomial compound Poisson samples (NB-CPSM), with the mixing variable $z$ a product of a Gamma and a scaled Mittag–Leffler (generalized stable) variable.

Specifically, setting $z = G_{\theta+\alpha n,1} S_{\alpha,\theta}$ , with $G$ independent Gamma and $S_{\alpha,\theta}$ a random variable with density $f_{S_{\alpha,\theta}}(s) \propto s^{\theta/\alpha} f_\alpha(s)$ (where $f_\alpha$ is the positive $\alpha$ -stable density), the EP $(\alpha,\theta)$ partition law coincides with the NB-CPSM $(\alpha,z)$ marginal, and the number of blocks $K_n$ concentrates (almost surely) as $K_n / n^\alpha \to S_{\alpha,\theta}$ (Dolera et al., 2021).

This compound Poisson approach seamlessly yields asymptotic results, closed-form formulas, and generalizations to Poisson–Kingman partitions.

4. Asymptotics: Laws of Large Numbers, Fluctuations, and Limit Theorems

For fixed $(\alpha,\theta)$ , the key scaling regimes are as follows (Contardi et al., 2024, Bercu et al., 2024, Tsukuda, 2020):

For $\alpha = 0$ (Ewens): $K_n \sim \theta \log n$ with Gaussian central limit fluctuations.
For $\alpha \in (0,1)$ : $K_n / n^\alpha \to S_{\alpha,\theta}$ almost surely, where $S_{\alpha,\theta}$ is $\alpha$ -Mittag–Leffler distributed; $\text{Var}(K_n) \sim C(\alpha, \theta) n^\alpha$ .
CLT: $(K_n - E[K_n]) / \sqrt{Var(K_n)} \to N(0,1)$ .
LIL: Law of the iterated logarithm applies to the centered, scaled block counts (Bercu et al., 2024).
Higher moments: $\mathbb{E}[K_n^r] = n^{\alpha r} \mathbb{E}[S_{\alpha,\theta}^r] - \frac{r(r-1)\alpha}{2\theta} n^{\alpha(r-1)} \mathbb{E}[S_{\alpha,\theta}^{r-1}] + O(n^{\alpha(r-2)})$ (Tsukuda, 2020).

For microclustering applications and scalable settings, scaling $\theta$ linearly with $n$ (i.e., $\theta = \lambda n$ ) yields a "microclustering" regime where the number of blocks and the counts of blocks of any fixed size both grow linearly with $n$ while the maximal cluster size remains $o(n)$ (Beraha et al., 24 Jul 2025, Contardi et al., 2024).

5. Large Deviations, Moderate Deviations, and Concentration

Large deviations: The sequence $K_n / n$ satisfies a large deviation principle (LDP) with rate function

$I_\alpha(x) = \sup_{t \in \mathbb{R}} \{xt - \Lambda_\alpha(t)\}$

where $\Lambda_\alpha(t)$ is a logarithmic transform involving the Mittag–Leffler function (Bercu et al., 9 Mar 2025, Favaro et al., 2014). An explicit sharp concentration inequality describes the probability of $K_n$ deviating from its mean.

Moderate deviations: Intermediate scaling regimes, for sequences $B_n$ with $(\log n)^{1-\alpha} \ll B_n \ll n^{1-\alpha}$ , yield corresponding rate functions $I_\alpha(x)$ providing precise transition descriptors between CLT and LDP scales (Favaro et al., 2016).
Block frequencies: Analogous large and moderate deviation principles hold for counts $M_{r,n}$ of blocks of fixed size $r$ (Favaro et al., 2014, Favaro et al., 2016).
Conditional LDP/MDP: Conditioning on partially observed partitions, the deviation rate functions remain unchanged – the initial sample's impact is negligible at large $n$ or sample-augmentation settings (Favaro et al., 2014, Favaro et al., 2016).

6. Representation Theory and Algebraic Structures

Ewens–Pitman partitions are characterized as non-extreme harmonic functions on the Kingman branching graph (infinite Young lattice) and are tightly linked to the combinatorics of symmetric group characters and interpolation polynomials (Greve, 6 Mar 2025). The partition probabilities admit explicit expansion in terms of Sheffer polynomial sequences and Riordan array sums, yielding effective computational methods for summary statistics, moments, and marginals. For example, the marginal probability of $K_n = k$ or joint factorial moments of block counts can be written as closed-form coefficients in generalized Stirling number expansions obtainable via generating function and Riordan array technology.

This algebraic approach both encapsulates the full system of sampling-consistent marginals and facilitates symbolic computations (Greve, 6 Mar 2025).

7. Applications, Biological and Statistical Significance

Population genetics: Ewens–Pitman partitions generalize the Ewens sampling formula (ESF) for modeling allelic diversity and mutation structures in finite populations (Giordano et al., 2019).
Bayesian nonparametrics: The Pitman–Yor process induced partitions serve as priors for clustering in Dirichlet and stable process mixture models—central in Bayesian statistics and machine learning.
Entity resolution: Microclustering variants (scaling $\theta$ with $n$ ) underpin scalable clustering and de-duplication/identity resolution with provable guarantees on block size and count growth rates (Beraha et al., 24 Jul 2025).
Species sampling and discovery probabilities: Tail asymptotics and conditional LDPs enable calculation of discovery probabilities, facilitating design and inference in ecological, genomic, and risk-assessment contexts (Favaro et al., 2014).
Random trees and fragmentation: Fragmentation and coagulation operations on Ewens–Pitman partitions generate Markov chains and random trees (e.g., continuum random trees), with the scaled block-size limits governed by Mittag–Leffler and stable laws (Ho et al., 2018, Mano, 2013).

8. Summary Table: Core Properties of Ewens–Pitman Partitions

Property	$\alpha=0$ (Ewens)	$0<\alpha<1$ (Pitman–Yor)
Block count growth	$K_n \sim \theta \log n$	$K_n / n^\alpha \to S_{\alpha,\theta}$ a.s.
Block size distribution	weak Dirichlet/multinomial	Power law; Sibuya law
Compound Poisson repr.	Log-series mixing (LS-CPSM)	NB-CPSM, mixed by ML law
Large deviation rate	Explicit, convex analytic	Given by $\Lambda_\alpha(t)$
Integrable structure	Stirling/Riordan (binomial)	Generalized Stirling, Riordan
Microclustering regime	$K_n \propto n$ iff $\theta\propto n$	$K_n \propto n$ iff $\theta\propto n$