Ewens–Pitman Partitions
- Ewens–Pitman partitions are a two-parameter family of exchangeable random partitions that generalize the classical Ewens sampling formula and the Pitman–Yor process.
- They are constructed via predictive models like the Chinese Restaurant Process and stick-breaking representations, providing clear probabilistic and combinatorial insights.
- These partitions underpin practical applications in population genetics, Bayesian nonparametrics, and machine learning through their well-defined asymptotic and large deviation behaviors.
Ewens–Pitman partitions constitute a two-parameter family of exchangeable random partitions over , determined by with either and , or and for . They interpolate between the classical Ewens sampling formula () and the two-parameter Poisson–Dirichlet (Pitman–Yor) distribution (), and admit deep connections with Gibbs partitions, stable subordinators, generalized Stirling numbers, compound Poisson representations, and the combinatorics of symmetric groups. Rich probabilistic, asymptotic, and algebraic structures underlie these partitions, yielding both practical statistical tools and theoretical insight into fragmentation, random trees, and Bayesian nonparametrics.
1. Formal Definition and Exchangeable Partition Probability Function
A random partition of with block sizes () is assigned the Ewens–Pitman probability
where is the Pochhammer symbol, and encodes multiplicative block-size weights (Greve, 6 Mar 2025, Dolera et al., 2021).
In combinatorial terms, the total probability of an unordered set partition with sizes is
The parameters must satisfy , or , .
This generalizes the Ewens sampling formula (), for which
and the Poisson–Dirichlet process for .
2. Probabilistic Constructions and Predictive Structure
Ewens–Pitman partitions are equivalently described via the Chinese Restaurant Process (CRP). Given a partial partition of into blocks with sizes , the (n+1)-st element joins block with probability
or initiates a new block with probability
This sequential construction yields exchangeable distributions on partitions and underpins their representation as de Finetti mixtures over random discrete measures, notably the Pitman–Yor process (Greve, 6 Mar 2025, Dolera et al., 2021, Favaro et al., 2014).
In the stick-breaking representation, for , , the mass partition has weights
and drawing i.i.d. samples from the resulting random measure induces the Ewens–Pitman random partition (Favaro et al., 2016, Ho et al., 2018).
3. Compound Poisson Interpretations
Ewens–Pitman partitions admit an interpretation as mixtures of compound Poisson sampling models (Dolera et al., 2021):
- For (Ewens), block counts correspond to conditioning the total size of a log-series compound Poisson sample (LS-CPSM).
- For general , block counts arise as mixtures (in ) over negative-Binomial compound Poisson samples (NB-CPSM), with the mixing variable a product of a Gamma and a scaled Mittag–Leffler (generalized stable) variable.
Specifically, setting , with independent Gamma and a random variable with density (where is the positive -stable density), the EP partition law coincides with the NB-CPSM marginal, and the number of blocks concentrates (almost surely) as (Dolera et al., 2021).
This compound Poisson approach seamlessly yields asymptotic results, closed-form formulas, and generalizations to Poisson–Kingman partitions.
4. Asymptotics: Laws of Large Numbers, Fluctuations, and Limit Theorems
For fixed , the key scaling regimes are as follows (Contardi et al., 2024, Bercu et al., 2024, Tsukuda, 2020):
- For (Ewens): with Gaussian central limit fluctuations.
- For : almost surely, where is -Mittag–Leffler distributed; .
- CLT: .
- LIL: Law of the iterated logarithm applies to the centered, scaled block counts (Bercu et al., 2024).
- Higher moments: (Tsukuda, 2020).
For microclustering applications and scalable settings, scaling linearly with (i.e., ) yields a "microclustering" regime where the number of blocks and the counts of blocks of any fixed size both grow linearly with while the maximal cluster size remains (Beraha et al., 24 Jul 2025, Contardi et al., 2024).
5. Large Deviations, Moderate Deviations, and Concentration
- Large deviations: The sequence satisfies a large deviation principle (LDP) with rate function
where is a logarithmic transform involving the Mittag–Leffler function (Bercu et al., 9 Mar 2025, Favaro et al., 2014). An explicit sharp concentration inequality describes the probability of deviating from its mean.
- Moderate deviations: Intermediate scaling regimes, for sequences with , yield corresponding rate functions providing precise transition descriptors between CLT and LDP scales (Favaro et al., 2016).
- Block frequencies: Analogous large and moderate deviation principles hold for counts of blocks of fixed size (Favaro et al., 2014, Favaro et al., 2016).
- Conditional LDP/MDP: Conditioning on partially observed partitions, the deviation rate functions remain unchanged – the initial sample's impact is negligible at large or sample-augmentation settings (Favaro et al., 2014, Favaro et al., 2016).
6. Representation Theory and Algebraic Structures
Ewens–Pitman partitions are characterized as non-extreme harmonic functions on the Kingman branching graph (infinite Young lattice) and are tightly linked to the combinatorics of symmetric group characters and interpolation polynomials (Greve, 6 Mar 2025). The partition probabilities admit explicit expansion in terms of Sheffer polynomial sequences and Riordan array sums, yielding effective computational methods for summary statistics, moments, and marginals. For example, the marginal probability of or joint factorial moments of block counts can be written as closed-form coefficients in generalized Stirling number expansions obtainable via generating function and Riordan array technology.
This algebraic approach both encapsulates the full system of sampling-consistent marginals and facilitates symbolic computations (Greve, 6 Mar 2025).
7. Applications, Biological and Statistical Significance
- Population genetics: Ewens–Pitman partitions generalize the Ewens sampling formula (ESF) for modeling allelic diversity and mutation structures in finite populations (Giordano et al., 2019).
- Bayesian nonparametrics: The Pitman–Yor process induced partitions serve as priors for clustering in Dirichlet and stable process mixture models—central in Bayesian statistics and machine learning.
- Entity resolution: Microclustering variants (scaling with ) underpin scalable clustering and de-duplication/identity resolution with provable guarantees on block size and count growth rates (Beraha et al., 24 Jul 2025).
- Species sampling and discovery probabilities: Tail asymptotics and conditional LDPs enable calculation of discovery probabilities, facilitating design and inference in ecological, genomic, and risk-assessment contexts (Favaro et al., 2014).
- Random trees and fragmentation: Fragmentation and coagulation operations on Ewens–Pitman partitions generate Markov chains and random trees (e.g., continuum random trees), with the scaled block-size limits governed by Mittag–Leffler and stable laws (Ho et al., 2018, Mano, 2013).
8. Summary Table: Core Properties of Ewens–Pitman Partitions
| Property | (Ewens) | (Pitman–Yor) |
|---|---|---|
| Block count growth | a.s. | |
| Block size distribution | weak Dirichlet/multinomial | Power law; Sibuya law |
| Compound Poisson repr. | Log-series mixing (LS-CPSM) | NB-CPSM, mixed by ML law |
| Large deviation rate | Explicit, convex analytic | Given by |
| Integrable structure | Stirling/Riordan (binomial) | Generalized Stirling, Riordan |
| Microclustering regime | iff | iff |
These properties summarize both the classical and non-standard regimes and their implications for stochastic modeling and asymptotic analysis.
References:
- Compound Poisson representations: (Dolera et al., 2021)
- Large deviation/concentration: (Bercu et al., 9 Mar 2025, Favaro et al., 2014, Favaro et al., 2016)
- Asymptotic CLT/LIL regimes: (Contardi et al., 2024, Bercu et al., 2024)
- Moments and fluctuations: (Tsukuda, 2020)
- Representation theory and Riordan arrays: (Greve, 6 Mar 2025)
- Microclustering and scalable inference: (Beraha et al., 24 Jul 2025)
- Fragmentation/coagulation: (Ho et al., 2018)
- Extreme block sizes: (Mano, 2013)
- Birth–death–immigration embedding: (Giordano et al., 2019)