Minimum Wasserstein Distance Estimation
- The framework leverages the Wasserstein distance as a loss function to measure discrepancies in probability distributions with minimal moment assumptions.
- It integrates geometric complexity via metric entropy and moment conditions to establish precise minimax rates of convergence.
- The empirical estimator is proven minimax optimal, highlighting the inherent limits of transport-based statistical methods.
The minimum Wasserstein distance estimation framework centers on the use of the Wasserstein metric from optimal transport theory as a statistical loss for estimating probability distributions, parameters, or mixing measures. This approach quantifies the discrepancy between an empirical distribution (observed data) and an estimator, model, or class, and establishes minimax rates of convergence, general lower and upper bounds, and principled algorithmic solutions for a broad array of settings. The framework applies to arbitrary metric sample spaces, requires only weak moment conditions, and leverages geometric and metric-entropy properties to characterize statistical difficulty and estimator performance.
1. Mathematical Framework and Minimax Risk
Let be a complete separable metric space and a model class of Borel probability measures on . The –Wasserstein distance between is defined as
where is the set of all couplings with as marginals. The estimator is any measurable map from samples to . The –Wasserstein risk is
and the minimax risk over is
Analysis of combines geometric complexity (metric entropy) properties of and weak tail (moment) conditions on the model class. The core complexity measures are the –covering number and –packing number :
- is the minimal cardinality of partitions of into sets of diameter at most ;
- is the maximal number of points in with mutual distance at least .
For unbounded spaces, finite-moment conditions are required. Define for and reference :
2. Upper and Lower Bounds on Estimation Error
Upper Bound: The empirical measure achieves
for any sequence and integer , where depends only on . The first term ("tail term") addresses mass outside bounded regions, controlled by Markov's inequality; the second ("entropy term") quantifies the complexity of the support at scale .
Lower Bound (packing): The –packing radius satisfies
which is sharp for the metric-entropy regime. The lower bound is constructed by reduction to discrete distribution estimation under and Fano-type arguments, showing can at best scale with the largest packing sets.
Lower Bound (heavy tails): If, in addition to moment constraints, contains a point at distance from , then
demonstrating the optimality of the tail term in full generality.
Special Cases:
- For finite support of size ,
- For , :
provided .
The minimax-optimal estimator is always the empirical measure, with no possible improvement from smoothing, kernel, or refinement under these minimal assumptions (Singh et al., 2018).
3. Proof Techniques and Structural Insights
- Upper bounds rely on nested multi-scale partitions of , controlling fluctuations via multinomial tails on the empirical mass in partition cells. Cost within cells at each scale is , and the overall discrepancy telescopes through scales, integrating metric entropy with multinomial deviation [Han & Weissman, 2015].
- Tail bound exploits Markov's inequality to control the empirical mass far from , naturally yielding the decay via moment bounds.
- Lower bounds project candidate estimators into -point packing sets, showing cost cannot be reduced below . Reduction to estimation on the packing set and standard information-theoretic lower bounds (Fano-type) yield tight rates.
- The heavy-tail regime is realized by constructing two-point measures separated by , balancing KL-divergence and moment constraints to ensure minimax sharpness.
These arguments demonstrate that both geometric complexity and moment control are essential: without both, convergence can be arbitrarily slow.
4. Algorithmic and Practical Properties
- The empirical measure is provably minimax rate-optimal across all regimes considered. No smoothing, kernel, or regularization improvement is possible under the class of assumptions imposed.
- The sample complexity for achieving risk is
where is the effective dimension determined by .
- These results sharply delineate when transport-based estimators—including Wasserstein-GANs, Sinkhorn barycenters, and robust estimators minimizing Wasserstein balls—can generalize: their sample complexity scales polynomially with as dictated by the ambient geometry and moment tail.
Table: Summary of regime-dependent minimax rates
| Space | Model Class | Rate | Empirical Estimator Optimal? |
|---|---|---|---|
| Finite | Yes | ||
| , | Yes | ||
| , | Yes | ||
| , | Yes | ||
| Unbounded, heavy tail | Yes |
5. Implications for Model Classes and Statistical Theory
- The framework is agnostic to the specifics of beyond its metric and entropy structure, accommodating highly general and possibly infinite-dimensional settings.
- In finite-support or Euclidean spaces, the empirical measure achieves parametric rates whenever , highlighting the intrinsic difficulty supplied by geometry and high dimension.
- These bounds are recognized as fundamental for understanding the learning-theoretic properties and sample complexity of modern optimal-transport-based algorithms, including generative adversarial approaches employing Wasserstein metrics.
- No improvement is possible without imposing stricter conditions on the model class (e.g., bounded densities, additional smoothness, etc.), thus setting clear limits for what can be achieved with solely minimal moment and entropy controls.
6. Connections and Generalization to Contemporary Estimation
This minimax framework grounds the design and theoretical evaluation of any transport-based statistical estimator, providing the information-theoretic limits that any algorithm must respect. It directly informs:
- The analysis of Wasserstein-based density estimators and their optimality;
- Robustness and generalization of transport-based inference under distributional shift;
- Algorithmic solutions that must account for both tail constraints and the geometric complexity of .
Relevant references include the original development in Singh & Poczos (Singh et al., 2018), and supporting asymptotics and practical analyses in Fournier & Guillin, Weed & Bach, and Han & Weissman.
References:
- "Minimax Distribution Estimation in Wasserstein Distance" (Singh et al., 2018)