Minimum Wasserstein Distance Estimation

Updated 15 January 2026

The framework leverages the Wasserstein distance as a loss function to measure discrepancies in probability distributions with minimal moment assumptions.
It integrates geometric complexity via metric entropy and moment conditions to establish precise minimax rates of convergence.
The empirical estimator is proven minimax optimal, highlighting the inherent limits of transport-based statistical methods.

The minimum Wasserstein distance estimation framework centers on the use of the Wasserstein metric from optimal transport theory as a statistical loss for estimating probability distributions, parameters, or mixing measures. This approach quantifies the discrepancy between an empirical distribution (observed data) and an estimator, model, or class, and establishes minimax rates of convergence, general lower and upper bounds, and principled algorithmic solutions for a broad array of settings. The framework applies to arbitrary metric sample spaces, requires only weak moment conditions, and leverages geometric and metric-entropy properties to characterize statistical difficulty and estimator performance.

1. Mathematical Framework and Minimax Risk

Let $(\mathcal{X},d)$ be a complete separable metric space and $\mathcal{P}$ a model class of Borel probability measures on $\mathcal{X}$ . The $p$ –Wasserstein distance between $P,Q \in \mathcal{P}$ is defined as

$W_p(P,Q) = \left\{ \inf_{\pi \in \Pi(P,Q)} \mathbb{E}_{(X,Y)\sim \pi}[ d(X,Y)^p ] \right\}^{1/p},$

where $\Pi(P,Q)$ is the set of all couplings with $P,Q$ as marginals. The estimator $\widehat{P}_n$ is any measurable map from samples $X_1,\ldots,X_n$ to $\mathcal{P}$ . The $p$ –Wasserstein risk is

$R_p(\widehat{P}_n;P) = \mathbb{E}[ W_p^p(\widehat{P}_n, P) ],$

and the minimax risk over $\mathcal{P}$ is

$R_p(n;\mathcal{P}) = \inf_{\widehat{P}_n} \sup_{P\in \mathcal{P}} R_p(\widehat{P}_n;P).$

Analysis of $R_p(n; \mathcal{P})$ combines geometric complexity (metric entropy) properties of $(\mathcal{X}, d)$ and weak tail (moment) conditions on the model class. The core complexity measures are the $\varepsilon$ –covering number $N(\mathcal{X}, \varepsilon)$ and $\varepsilon$ –packing number $M(\mathcal{X}, \varepsilon)$ :

$N(E,\varepsilon)$ is the minimal cardinality of partitions of $E$ into sets of diameter at most $\varepsilon$ ;
$M(E,\varepsilon)$ is the maximal number of points in $E$ with mutual distance at least $\varepsilon$ .

For unbounded spaces, finite-moment conditions are required. Define for $\ell>p$ and reference $x_0$ : $m_{\ell, x_0}(P) = \left( \mathbb{E}[ d(x_0, X)^\ell ] \right)^{1/\ell}, \quad \mathcal{P} = \{ P \mid m_{\ell, x_0}(P) \leq \mu \}.$

2. Upper and Lower Bounds on Estimation Error

Upper Bound: The empirical measure $P_n = (1/n) \sum_{i=1}^n \delta_{X_i}$ achieves

$\mathbb{E}[ W_p^p(P_n, P) ] \leq C_{\ell, p}\, m_{\ell, x_0}(P)^\ell\, \left\{ n^{(p-\ell)/\ell} + \sum_{j=1}^J \varepsilon_j^p \sqrt{ \frac{N(\mathcal{X}, \varepsilon_j)}{n} } + \varepsilon_J^p \right\}$

for any sequence $\varepsilon_1 > \ldots > \varepsilon_J > 0$ and integer $J \geq 1$ , where $C_{\ell, p}$ depends only on $\ell, p$ . The first term ("tail term") addresses mass outside bounded regions, controlled by Markov's inequality; the second ("entropy term") quantifies the complexity of the support at scale $\varepsilon_j$ .

Lower Bound (packing): The $k$ –packing radius $R(\mathcal{X}, k)$ satisfies

$R_p(n;\mathcal{P}) \geq c_p \sup_{k \leq 32 n} R(\mathcal{X},k)^p \sqrt{ k/n },$

which is sharp for the metric-entropy regime. The lower bound is constructed by reduction to discrete distribution estimation under $\ell_1$ and Fano-type arguments, showing $R_p$ can at best scale with the largest packing sets.

Lower Bound (heavy tails): If, in addition to moment constraints, $\mathcal{X}$ contains a point at distance $n^{1/\ell}$ from $x_0$ , then

$R_p(n;\mathcal{P}) \geq c_\mu n^{(p-\ell)/\ell},$

demonstrating the optimality of the tail term in full generality.

Special Cases:

For finite support of size $K$ , $R_p(n;\mathcal{P}) \asymp K^{1/2} n^{-1/2}$
For $\mathcal{X} = [0,1]^d$ , $N(\mathcal{X},\varepsilon) \asymp \varepsilon^{-d}$ :

$R_p(n;\mathcal{P}) \asymp \begin{cases} n^{-1/2}, & d < 2p \ n^{-1/2} \log n, & d = 2p \ n^{-p/d}, & d > 2p \end{cases}$

provided $\ell \gg p$ .

The minimax-optimal estimator is always the empirical measure, with no possible improvement from smoothing, kernel, or refinement under these minimal assumptions (Singh et al., 2018).

3. Proof Techniques and Structural Insights

Upper bounds rely on nested multi-scale partitions of $\mathcal{X}$ , controlling fluctuations via multinomial tails on the empirical mass in partition cells. Cost within cells at each scale $\varepsilon_j$ is $\varepsilon_j^p$ , and the overall discrepancy telescopes through scales, integrating metric entropy with multinomial deviation [Han & Weissman, 2015].
Tail bound exploits Markov's inequality to control the empirical mass far from $x_0$ , naturally yielding the $n^{(p-\ell)/\ell}$ decay via moment bounds.
Lower bounds project candidate estimators into $k$ -point packing sets, showing cost cannot be reduced below $R(\mathcal{X}, k)^p\sqrt{k/n}$ . Reduction to $\ell_1$ estimation on the packing set and standard information-theoretic lower bounds (Fano-type) yield tight rates.
The heavy-tail regime is realized by constructing two-point measures separated by $n^{1/\ell}$ , balancing KL-divergence and moment constraints to ensure minimax sharpness.

These arguments demonstrate that both geometric complexity and moment control are essential: without both, convergence can be arbitrarily slow.

4. Algorithmic and Practical Properties

The empirical measure $\widehat{P}_n$ is provably minimax rate-optimal across all regimes considered. No smoothing, kernel, or regularization improvement is possible under the class of assumptions imposed.
The sample complexity for achieving risk $R_p(n;\mathcal{P}) \leq \varepsilon$ is

$n \gtrsim \max \{ \varepsilon^{-2},\, \varepsilon^{-s/p},\, \varepsilon^{-\ell/(\ell-p)} \},$

where $s$ is the effective dimension determined by $N(\mathcal{X},\varepsilon) \asymp \varepsilon^{-s}$ .

These results sharply delineate when transport-based estimators—including Wasserstein-GANs, Sinkhorn barycenters, and robust estimators minimizing Wasserstein balls—can generalize: their sample complexity scales polynomially with $s,p,\ell$ as dictated by the ambient geometry and moment tail.

Table: Summary of regime-dependent minimax rates

Space	Model Class	Rate $R_p(n;\mathcal{P})$	Empirical Estimator Optimal?
Finite $K$	$N(\mathcal{X},\varepsilon)=K$	$K^{1/2} n^{-1/2}$	Yes
$[0,1]^d$ , $d<2p$	$\ell\gg p$	$n^{-1/2}$	Yes
$[0,1]^d$ , $d=2p$	$\ell\gg p$	$n^{-1/2} \log n$	Yes
$[0,1]^d$ , $d>2p$	$\ell\gg p$	$n^{-p/d}$	Yes
Unbounded, heavy tail	$m_{\ell,x_0}(P)\leq \mu$	$n^{(p-\ell)/\ell}$	Yes

5. Implications for Model Classes and Statistical Theory

The framework is agnostic to the specifics of $\mathcal{X}$ beyond its metric and entropy structure, accommodating highly general and possibly infinite-dimensional settings.
In finite-support or Euclidean spaces, the empirical measure achieves parametric rates whenever $d<2p$ , highlighting the intrinsic difficulty supplied by geometry and high dimension.
These bounds are recognized as fundamental for understanding the learning-theoretic properties and sample complexity of modern optimal-transport-based algorithms, including generative adversarial approaches employing Wasserstein metrics.
No improvement is possible without imposing stricter conditions on the model class (e.g., bounded densities, additional smoothness, etc.), thus setting clear limits for what can be achieved with solely minimal moment and entropy controls.

6. Connections and Generalization to Contemporary Estimation

This minimax framework grounds the design and theoretical evaluation of any transport-based statistical estimator, providing the information-theoretic limits that any algorithm must respect. It directly informs:

The analysis of Wasserstein-based density estimators and their optimality;
Robustness and generalization of transport-based inference under distributional shift;
Algorithmic solutions that must account for both tail constraints and the geometric complexity of $\mathcal{X}$ .

Relevant references include the original development in Singh & Poczos (Singh et al., 2018), and supporting asymptotics and practical analyses in Fournier & Guillin, Weed & Bach, and Han & Weissman.

References:

"Minimax Distribution Estimation in Wasserstein Distance" (Singh et al., 2018)

Markdown Report Issue Upgrade to Chat

References (1)

Minimax Distribution Estimation in Wasserstein Distance (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Minimum Wasserstein Distance Estimation Framework.