Papers
Topics
Authors
Recent
Search
2000 character limit reached

Wasserstein Distance: Theory & Applications

Updated 8 December 2025
  • Wasserstein distance is a metric that quantifies discrepancies between probability measures by solving an optimal mass transport problem.
  • It encompasses both deterministic (Monge) and relaxed (Kantorovich) formulations, providing a geometric framework for comparing distributions.
  • Its applications span imaging, machine learning, PDE analysis, and data science, addressing issues like mass fluctuations and stability.

The Wasserstein distance, also known as the optimal transport (OT) distance or Earth Mover's Distance (EMD) in specific cases, is a fundamental metric that quantifies the discrepancy between probability measures by solving a mass transportation problem. It provides a geometric framework for comparing distributions and has found wide application across mathematics, probability theory, statistics, machine learning, computer vision, and the analysis of partial differential equations.

1. Mathematical Definition and Foundational Principles

Let (X,d)(\mathcal{X}, d) be a complete separable metric space, and let Pp(X)\mathcal{P}_p(\mathcal{X}) denote the set of Borel probability measures on X\mathcal{X} with finite pthp^{\text{th}} moment. For μ,νPp(X)\mu, \nu \in \mathcal{P}_p(\mathcal{X}), the pp–Wasserstein distance is defined as

Wp(μ,ν)=(infγΓ(μ,ν)X×Xd(x,y)pdγ(x,y))1/pW_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathcal{X}\times\mathcal{X}} d(x, y)^p\, d\gamma(x, y) \right)^{1/p}

where Γ(μ,ν)\Gamma(\mu, \nu) is the set of all transport plans (couplings) with marginals μ\mu and ν\nu.

The Pp(X)\mathcal{P}_p(\mathcal{X})0 case admits the Kantorovich–Rubinstein dual representation: Pp(X)\mathcal{P}_p(\mathcal{X})1 where Pp(X)\mathcal{P}_p(\mathcal{X})2 denotes the Lipschitz constant of Pp(X)\mathcal{P}_p(\mathcal{X})3 (Piccoli et al., 2013).

The Wasserstein distance is a bona fide metric on Pp(X)\mathcal{P}_p(\mathcal{X})4, satisfying nonnegativity, symmetry, identity of indiscernibles, and the triangle inequality (Panaretos et al., 2018). For Pp(X)\mathcal{P}_p(\mathcal{X})5 to be finite, both measures must have finite Pp(X)\mathcal{P}_p(\mathcal{X})6–moment.

2. Interpretation, Variants, and Duality

Monge and Kantorovich Formulations

The OT formulation seeks the least-cost way of transporting one distribution to another. The Monge problem requires a deterministic transport map Pp(X)\mathcal{P}_p(\mathcal{X})7 minimizing Pp(X)\mathcal{P}_p(\mathcal{X})8, while Kantorovich's relaxation allows for couplings Pp(X)\mathcal{P}_p(\mathcal{X})9 and always achieves a minimum (Piccoli et al., 2013).

Benamou–Brenier Dynamical Characterization

For X\mathcal{X}0, there is a dynamic fluid-mechanical representation: X\mathcal{X}1 subject to the continuity equation

X\mathcal{X}2

(Piccoli et al., 2013).

The Flat Metric

For arbitrary (possibly unequal mass) Radon measures, the generalized Wasserstein distance X\mathcal{X}3 coincides with the flat (bounded-Lipschitz) metric: X\mathcal{X}4 (Piccoli et al., 2013).

3. Extensions and Computational Methods

Generalized Wasserstein Distance

For measures X\mathcal{X}5 of possibly differing total mass and parameters X\mathcal{X}6, the generalized Wasserstein distance is defined as

X\mathcal{X}7

where X\mathcal{X}8 is the total variation of the "removed" mass, and the infimum is over pairs with equal total mass. The X\mathcal{X}9 term penalizes creation/removal, pthp^{\text{th}}0 the transport, and pthp^{\text{th}}1 controls aggregation (Piccoli et al., 2013).

Generalized Benamou–Brenier Formula

A dynamic formulation extends to pthp^{\text{th}}2: pthp^{\text{th}}3 where pthp^{\text{th}}4 encodes sources/sinks and the continuity equation has a source term pthp^{\text{th}}5. This subsumes pure mass transport and allows for creation/removal (Piccoli et al., 2013).

Existence and Homogeneity

pthp^{\text{th}}6 is a metric on the cone of nonnegative Radon measures, is homogeneous pthp^{\text{th}}7 for any pthp^{\text{th}}8, and attains its infimum for each pair of measures (Piccoli et al., 2013).

4. Analytical and Practical Properties

Mass Mismatch and Total Variation

When pthp^{\text{th}}9 and μ,νPp(X)\mu, \nu \in \mathcal{P}_p(\mathcal{X})0, μ,νPp(X)\mu, \nu \in \mathcal{P}_p(\mathcal{X})1 reduces to the pure μ,νPp(X)\mu, \nu \in \mathcal{P}_p(\mathcal{X})2, and when μ,νPp(X)\mu, \nu \in \mathcal{P}_p(\mathcal{X})3, it reduces to the total variation norm. For μ,νPp(X)\mu, \nu \in \mathcal{P}_p(\mathcal{X})4, μ,νPp(X)\mu, \nu \in \mathcal{P}_p(\mathcal{X})5, the equality μ,νPp(X)\mu, \nu \in \mathcal{P}_p(\mathcal{X})6 (flat metric) holds (Piccoli et al., 2013). Explicitly, for μ,νPp(X)\mu, \nu \in \mathcal{P}_p(\mathcal{X})7, μ,νPp(X)\mu, \nu \in \mathcal{P}_p(\mathcal{X})8,

μ,νPp(X)\mu, \nu \in \mathcal{P}_p(\mathcal{X})9

exhibiting the tradeoff between removal/addition and transportation costs.

Connection to Partial Differential Equations

Wasserstein distances and their generalizations are especially relevant for evolution equations such as the continuity equation with source, where one typically needs to compare measures of variable mass. The pp0 framework is adapted to these contexts and yields contraction or stability estimates even for solutions that do not preserve total mass (Piccoli et al., 2013).

Limits and Interpolations

pp1 provides a continuous interpolation between pp2 distance (as pp3, penalizing all transport) and the classical Wasserstein distance (as pp4, no penalty for creation/removal). This is particularly valuable in applications such as comparing histograms of unequal mass—common in imaging and statistical data analysis.

5. Theoretical and Algorithmic Framework

Fenchel–Legendre Duality

The proof of the equivalence between the pp5 and the flat metric relies on convex analysis and Fenchel–Legendre duality: the sum of convex indicators for pp6 and pp7 leads, via a theorem of Rockafellar, to a dual representation that exactly matches the primal pp8 definition (Piccoli et al., 2013).

Algorithmic Considerations

For pp9, the dynamic programming Benamou–Brenier approach yields an explicit minimization over velocity fields and source terms. The infimum is realized, and the action can be constructed explicitly through "sample-and-hold" schemes that alternate between mass removal, transport, and creation in small time intervals. Convexity and stability under flow are key technical lemmas supporting these constructions.

Examples of Computation

For measures concentrated on points with different masses, optimal decomposition may entail only mass removal/addition, only transport, or a mixture, determined by the ratio Wp(μ,ν)=(infγΓ(μ,ν)X×Xd(x,y)pdγ(x,y))1/pW_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathcal{X}\times\mathcal{X}} d(x, y)^p\, d\gamma(x, y) \right)^{1/p}0. If Wp(μ,ν)=(infγΓ(μ,ν)X×Xd(x,y)pdγ(x,y))1/pW_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathcal{X}\times\mathcal{X}} d(x, y)^p\, d\gamma(x, y) \right)^{1/p}1, it's optimal to remove/add all; otherwise, it pays to transport part of the mass.

6. Applications and Implications

Imaging, Data Analysis, and Beyond

The generalized Wasserstein metric Wp(μ,ν)=(infγΓ(μ,ν)X×Xd(x,y)pdγ(x,y))1/pW_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathcal{X}\times\mathcal{X}} d(x, y)^p\, d\gamma(x, y) \right)^{1/p}2 allows meaningful comparison of data distributions (histograms, point clouds) with mass fluctuations. This is essential in image processing and vision, where illumination or occlusion can alter total mass, and in statistical analysis of data sets with missing data or over-sampling.

PDE Theory and Contractivity

Wp(μ,ν)=(infγΓ(μ,ν)X×Xd(x,y)pdγ(x,y))1/pW_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathcal{X}\times\mathcal{X}} d(x, y)^p\, d\gamma(x, y) \right)^{1/p}3 has enabled new existence and stability results for evolution equations with source terms, accommodating solutions where total mass is not preserved, and guaranteeing meaningful contractivity in this extended framework (Piccoli et al., 2013).

Hierarchical Relation to Wp(μ,ν)=(infγΓ(μ,ν)X×Xd(x,y)pdγ(x,y))1/pW_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathcal{X}\times\mathcal{X}} d(x, y)^p\, d\gamma(x, y) \right)^{1/p}4

Wp(μ,ν)=(infγΓ(μ,ν)X×Xd(x,y)pdγ(x,y))1/pW_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathcal{X}\times\mathcal{X}} d(x, y)^p\, d\gamma(x, y) \right)^{1/p}5 recovers Wp(μ,ν)=(infγΓ(μ,ν)X×Xd(x,y)pdγ(x,y))1/pW_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathcal{X}\times\mathcal{X}} d(x, y)^p\, d\gamma(x, y) \right)^{1/p}6 and the total variation metric in limits and thus underlies a unifying theory for purely geometric transport and purely mass error terms.


References:

  • Piccoli, B. & Rossi, F. "On properties of the Generalized Wasserstein distance" (Piccoli et al., 2013)
  • Benamou, J.-D. & Brenier, Y. "A computational fluid mechanics solution to the Monge–Kantorovich mass transfer problem"
  • Villani, C. "Optimal Transport: Old and New," Springer

This summarization encapsulates the structure, properties, dualities, analytical formulations, and key application domains of the classical and generalized Wasserstein distances as rigorously delineated in (Piccoli et al., 2013).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Wasserstein Distance.