Wasserstein Distance: Theory & Applications

Updated 8 December 2025

Wasserstein distance is a metric that quantifies discrepancies between probability measures by solving an optimal mass transport problem.
It encompasses both deterministic (Monge) and relaxed (Kantorovich) formulations, providing a geometric framework for comparing distributions.
Its applications span imaging, machine learning, PDE analysis, and data science, addressing issues like mass fluctuations and stability.

The Wasserstein distance, also known as the optimal transport (OT) distance or Earth Mover's Distance (EMD) in specific cases, is a fundamental metric that quantifies the discrepancy between probability measures by solving a mass transportation problem. It provides a geometric framework for comparing distributions and has found wide application across mathematics, probability theory, statistics, machine learning, computer vision, and the analysis of partial differential equations.

1. Mathematical Definition and Foundational Principles

Let $(\mathcal{X}, d)$ be a complete separable metric space, and let $\mathcal{P}_p(\mathcal{X})$ denote the set of Borel probability measures on $\mathcal{X}$ with finite $p^{\text{th}}$ moment. For $\mu, \nu \in \mathcal{P}_p(\mathcal{X})$ , the $p$ –Wasserstein distance is defined as

$W_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathcal{X}\times\mathcal{X}} d(x, y)^p\, d\gamma(x, y) \right)^{1/p}$

where $\Gamma(\mu, \nu)$ is the set of all transport plans (couplings) with marginals $\mu$ and $\nu$ .

The $\mathcal{P}_p(\mathcal{X})$ 0 case admits the Kantorovich–Rubinstein dual representation: $\mathcal{P}_p(\mathcal{X})$ 1 where $\mathcal{P}_p(\mathcal{X})$ 2 denotes the Lipschitz constant of $\mathcal{P}_p(\mathcal{X})$ 3 (Piccoli et al., 2013).

The Wasserstein distance is a bona fide metric on $\mathcal{P}_p(\mathcal{X})$ 4, satisfying nonnegativity, symmetry, identity of indiscernibles, and the triangle inequality (Panaretos et al., 2018). For $\mathcal{P}_p(\mathcal{X})$ 5 to be finite, both measures must have finite $\mathcal{P}_p(\mathcal{X})$ 6–moment.

2. Interpretation, Variants, and Duality

Monge and Kantorovich Formulations

The OT formulation seeks the least-cost way of transporting one distribution to another. The Monge problem requires a deterministic transport map $\mathcal{P}_p(\mathcal{X})$ 7 minimizing $\mathcal{P}_p(\mathcal{X})$ 8, while Kantorovich's relaxation allows for couplings $\mathcal{P}_p(\mathcal{X})$ 9 and always achieves a minimum (Piccoli et al., 2013).

Benamou–Brenier Dynamical Characterization

For $\mathcal{X}$ 0, there is a dynamic fluid-mechanical representation: $\mathcal{X}$ 1 subject to the continuity equation

$\mathcal{X}$ 2

(Piccoli et al., 2013).

The Flat Metric

For arbitrary (possibly unequal mass) Radon measures, the generalized Wasserstein distance $\mathcal{X}$ 3 coincides with the flat (bounded-Lipschitz) metric: $\mathcal{X}$ 4 (Piccoli et al., 2013).

3. Extensions and Computational Methods

Generalized Wasserstein Distance

For measures $\mathcal{X}$ 5 of possibly differing total mass and parameters $\mathcal{X}$ 6, the generalized Wasserstein distance is defined as

$\mathcal{X}$ 7

where $\mathcal{X}$ 8 is the total variation of the "removed" mass, and the infimum is over pairs with equal total mass. The $\mathcal{X}$ 9 term penalizes creation/removal, $p^{\text{th}}$ 0 the transport, and $p^{\text{th}}$ 1 controls aggregation (Piccoli et al., 2013).

Generalized Benamou–Brenier Formula

A dynamic formulation extends to $p^{\text{th}}$ 2: $p^{\text{th}}$ 3 where $p^{\text{th}}$ 4 encodes sources/sinks and the continuity equation has a source term $p^{\text{th}}$ 5. This subsumes pure mass transport and allows for creation/removal (Piccoli et al., 2013).

Existence and Homogeneity

$p^{\text{th}}$ 6 is a metric on the cone of nonnegative Radon measures, is homogeneous $p^{\text{th}}$ 7 for any $p^{\text{th}}$ 8, and attains its infimum for each pair of measures (Piccoli et al., 2013).

4. Analytical and Practical Properties

Mass Mismatch and Total Variation

When $p^{\text{th}}$ 9 and $\mu, \nu \in \mathcal{P}_p(\mathcal{X})$ 0, $\mu, \nu \in \mathcal{P}_p(\mathcal{X})$ 1 reduces to the pure $\mu, \nu \in \mathcal{P}_p(\mathcal{X})$ 2, and when $\mu, \nu \in \mathcal{P}_p(\mathcal{X})$ 3, it reduces to the total variation norm. For $\mu, \nu \in \mathcal{P}_p(\mathcal{X})$ 4, $\mu, \nu \in \mathcal{P}_p(\mathcal{X})$ 5, the equality $\mu, \nu \in \mathcal{P}_p(\mathcal{X})$ 6 (flat metric) holds (Piccoli et al., 2013). Explicitly, for $\mu, \nu \in \mathcal{P}_p(\mathcal{X})$ 7, $\mu, \nu \in \mathcal{P}_p(\mathcal{X})$ 8,

$\mu, \nu \in \mathcal{P}_p(\mathcal{X})$ 9

exhibiting the tradeoff between removal/addition and transportation costs.

Connection to Partial Differential Equations

Wasserstein distances and their generalizations are especially relevant for evolution equations such as the continuity equation with source, where one typically needs to compare measures of variable mass. The $p$ 0 framework is adapted to these contexts and yields contraction or stability estimates even for solutions that do not preserve total mass (Piccoli et al., 2013).

Limits and Interpolations

$p$ 1 provides a continuous interpolation between $p$ 2 distance (as $p$ 3, penalizing all transport) and the classical Wasserstein distance (as $p$ 4, no penalty for creation/removal). This is particularly valuable in applications such as comparing histograms of unequal mass—common in imaging and statistical data analysis.

5. Theoretical and Algorithmic Framework

Fenchel–Legendre Duality

The proof of the equivalence between the $p$ 5 and the flat metric relies on convex analysis and Fenchel–Legendre duality: the sum of convex indicators for $p$ 6 and $p$ 7 leads, via a theorem of Rockafellar, to a dual representation that exactly matches the primal $p$ 8 definition (Piccoli et al., 2013).

Algorithmic Considerations

For $p$ 9, the dynamic programming Benamou–Brenier approach yields an explicit minimization over velocity fields and source terms. The infimum is realized, and the action can be constructed explicitly through "sample-and-hold" schemes that alternate between mass removal, transport, and creation in small time intervals. Convexity and stability under flow are key technical lemmas supporting these constructions.

Examples of Computation

For measures concentrated on points with different masses, optimal decomposition may entail only mass removal/addition, only transport, or a mixture, determined by the ratio $W_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathcal{X}\times\mathcal{X}} d(x, y)^p\, d\gamma(x, y) \right)^{1/p}$ 0. If $W_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathcal{X}\times\mathcal{X}} d(x, y)^p\, d\gamma(x, y) \right)^{1/p}$ 1, it's optimal to remove/add all; otherwise, it pays to transport part of the mass.

6. Applications and Implications

Imaging, Data Analysis, and Beyond

The generalized Wasserstein metric $W_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathcal{X}\times\mathcal{X}} d(x, y)^p\, d\gamma(x, y) \right)^{1/p}$ 2 allows meaningful comparison of data distributions (histograms, point clouds) with mass fluctuations. This is essential in image processing and vision, where illumination or occlusion can alter total mass, and in statistical analysis of data sets with missing data or over-sampling.

PDE Theory and Contractivity

$W_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathcal{X}\times\mathcal{X}} d(x, y)^p\, d\gamma(x, y) \right)^{1/p}$ 3 has enabled new existence and stability results for evolution equations with source terms, accommodating solutions where total mass is not preserved, and guaranteeing meaningful contractivity in this extended framework (Piccoli et al., 2013).

Hierarchical Relation to $W_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathcal{X}\times\mathcal{X}} d(x, y)^p\, d\gamma(x, y) \right)^{1/p}$ 4

$W_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathcal{X}\times\mathcal{X}} d(x, y)^p\, d\gamma(x, y) \right)^{1/p}$ 5 recovers $W_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathcal{X}\times\mathcal{X}} d(x, y)^p\, d\gamma(x, y) \right)^{1/p}$ 6 and the total variation metric in limits and thus underlies a unifying theory for purely geometric transport and purely mass error terms.

References:

Piccoli, B. & Rossi, F. "On properties of the Generalized Wasserstein distance" (Piccoli et al., 2013)
Benamou, J.-D. & Brenier, Y. "A computational fluid mechanics solution to the Monge–Kantorovich mass transfer problem"
Villani, C. "Optimal Transport: Old and New," Springer

This summarization encapsulates the structure, properties, dualities, analytical formulations, and key application domains of the classical and generalized Wasserstein distances as rigorously delineated in (Piccoli et al., 2013).

Markdown Report Issue Upgrade to Chat

References (2)

On properties of the Generalized Wasserstein distance (2013)

Statistical Aspects of Wasserstein Distances (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Wasserstein Distance.