Wasserstein Gradient Flows (WGF)

Updated 7 February 2026

Wasserstein Gradient Flows are continuous-time dynamical systems that characterize the steepest descent evolution of functionals over probability measures via the 2-Wasserstein metric.
Discrete-time schemes like the JKO and forward–backward splitting methods provide practical approximations with provable convergence rates under convexity assumptions.
This framework bridges optimal transport, PDE analysis, and machine learning, enabling rigorous and scalable optimization in infinite-dimensional spaces.

Wasserstein Gradient Flows (WGF) are continuous-time dynamical systems that characterize the steepest descent evolution of a functional over the space of probability measures endowed with the 2-Wasserstein metric. The WGF framework provides a rigorous, geometrically-intrinsic generalization of gradient descent to infinite-dimensional spaces, with foundational relevance across optimal transport, partial differential equations, and probabilistic machine learning.

1. The 2-Wasserstein Space: Metric, Geometry, and Geodesics

The space of Borel probability measures on ℝᵈ with finite second moments,

$\mathcal{P}_2(\mathbb{R}^d) := \left\{ \mu \text{ probability on } \mathbb{R}^d : \int \|x\|^2\,d\mu(x) < \infty \right\},$

equipped with the 2-Wasserstein distance,

$W_2^2(\mu,\nu) := \inf_{\pi\in\Gamma(\mu,\nu)} \int_{\mathbb{R}^d\times \mathbb{R}^d} \|x-y\|^2\,d\pi(x,y),$

becomes a geodesic metric space, where $\Gamma(\mu,\nu)$ is the set of couplings of μ and ν. When μ is absolutely continuous, the optimal transport map is given by the gradient of a convex function (Brenier's theorem), and constant-speed geodesics can be constructed as pushforwards via interpolated maps: $\mu_t = ((1-t)\mathrm{Id} + t\,T)_{\#}\mu$ for t∈[0,1], where $T$ is the optimal transport map. The geodesic structure is central for defining “steepest descent” in this space (Salim et al., 2020).

2. Continuous-Time Formulation: Evolution Equation and Variational Characterization

For a given functional $\mathcal{F} : \mathcal{P}_2 \to (-\infty, +\infty]$ , the curve $\mu(t)$ solving the Wasserstein gradient flow is characterized by the Evolution Variational Inequality (EVI):

$\forall \nu \in \mathcal{P}_2, \quad \frac{d}{dt} W_2^2(\mu(t), \nu) \leq -2 \big[ \mathcal{F}(\mu(t)) - \mathcal{F}(\nu) \big].$

Under regularity conditions, this is equivalent to a PDE for the density $\rho(t,x)$ :

$\partial_t \rho + \nabla \cdot (\rho v) = 0, \quad v = -\nabla_x \left(\frac{\delta\mathcal{F}}{\delta \rho} \right),$

where $W_2^2(\mu,\nu) := \inf_{\pi\in\Gamma(\mu,\nu)} \int_{\mathbb{R}^d\times \mathbb{R}^d} \|x-y\|^2\,d\pi(x,y),$ 0 denotes the first variation of $W_2^2(\mu,\nu) := \inf_{\pi\in\Gamma(\mu,\nu)} \int_{\mathbb{R}^d\times \mathbb{R}^d} \|x-y\|^2\,d\pi(x,y),$ 1. For example, if $W_2^2(\mu,\nu) := \inf_{\pi\in\Gamma(\mu,\nu)} \int_{\mathbb{R}^d\times \mathbb{R}^d} \|x-y\|^2\,d\pi(x,y),$ 2, the gradient flow yields the Fokker-Planck equation, a prototypical diffusive evolution (Salim et al., 2020).

3. Discrete-Time Schemes: JKO and Forward-Backward Splitting

The canonical time-discretization of WGF is the Jordan–Kinderlehrer–Otto (JKO) implicit Euler scheme:

$W_2^2(\mu,\nu) := \inf_{\pi\in\Gamma(\mu,\nu)} \int_{\mathbb{R}^d\times \mathbb{R}^d} \|x-y\|^2\,d\pi(x,y),$ 3

This yields a sequence whose piecewise-constant interpolation converges to the continuous WGF as $W_2^2(\mu,\nu) := \inf_{\pi\in\Gamma(\mu,\nu)} \int_{\mathbb{R}^d\times \mathbb{R}^d} \|x-y\|^2\,d\pi(x,y),$ 4.

When the objective function decomposes as $W_2^2(\mu,\nu) := \inf_{\pi\in\Gamma(\mu,\nu)} \int_{\mathbb{R}^d\times \mathbb{R}^d} \|x-y\|^2\,d\pi(x,y),$ 5 with $W_2^2(\mu,\nu) := \inf_{\pi\in\Gamma(\mu,\nu)} \int_{\mathbb{R}^d\times \mathbb{R}^d} \|x-y\|^2\,d\pi(x,y),$ 6 smooth and $W_2^2(\mu,\nu) := \inf_{\pi\in\Gamma(\mu,\nu)} \int_{\mathbb{R}^d\times \mathbb{R}^d} \|x-y\|^2\,d\pi(x,y),$ 7 possibly nonsmooth but geodesically convex, the Forward–Backward (FB) proximal-gradient algorithm over $W_2^2(\mu,\nu) := \inf_{\pi\in\Gamma(\mu,\nu)} \int_{\mathbb{R}^d\times \mathbb{R}^d} \|x-y\|^2\,d\pi(x,y),$ 8 is defined as:

Forward (gradient) step for $W_2^2(\mu,\nu) := \inf_{\pi\in\Gamma(\mu,\nu)} \int_{\mathbb{R}^d\times \mathbb{R}^d} \|x-y\|^2\,d\pi(x,y),$ 9: $\Gamma(\mu,\nu)$ 0,
Backward (proximal) step for $\Gamma(\mu,\nu)$ 1: $\Gamma(\mu,\nu)$ 2,

mirroring the classical Euclidean proximal-point framework. Here, $\Gamma(\mu,\nu)$ 3 is a JKO step for $\Gamma(\mu,\nu)$ 4 only (Salim et al., 2020).

4. Convergence Theory for Proximal Splitting and Rates

Suppose $\Gamma(\mu,\nu)$ 5 is $\Gamma(\mu,\nu)$ 6-smooth and $\Gamma(\mu,\nu)$ 7-strongly convex, and $\Gamma(\mu,\nu)$ 8 is proper, lower semicontinuous, and convex along generalized geodesics. If $\Gamma(\mu,\nu)$ 9, the FB scheme satisfies a discrete EVI:

$\mu_t = ((1-t)\mathrm{Id} + t\,T)_{\#}\mu$ 0

If $\mu_t = ((1-t)\mathrm{Id} + t\,T)_{\#}\mu$ 1, $\mu_t = ((1-t)\mathrm{Id} + t\,T)_{\#}\mu$ 2.
If $\mu_t = ((1-t)\mathrm{Id} + t\,T)_{\#}\mu$ 3, $\mu_t = ((1-t)\mathrm{Id} + t\,T)_{\#}\mu$ 4 (linear convergence).

This result establishes WGF-FB as an infinite-dimensional analog of the proximal gradient method, retaining convergence guarantees familiar from convex Euclidean optimization (Salim et al., 2020).

5. Practical Implementation, Computational Aspects, and Examples

Continuous-time WGF enjoys exact decay rates, while discrete-time schemes (JKO, FB) match these rates up to step-size constraints. The main numerical challenge is evaluating the proximal map (JKO subproblem), which, depending on $\mu_t = ((1-t)\mathrm{Id} + t\,T)_{\#}\mu$ 5, may admit:

Closed-form solutions (e.g., negative entropy/heat flow),
PDE-based solvers (for more complex energies),
Entropic regularization or Sinkhorn algorithms for approximation.

FB splitting reduces the implicit computation to the $\mu_t = ((1-t)\mathrm{Id} + t\,T)_{\#}\mu$ 6 part only, with the $\mu_t = ((1-t)\mathrm{Id} + t\,T)_{\#}\mu$ 7 part handled by a simple push-forward. In the canonical quadratic-plus-entropy example (sampling from a Gaussian), each FB step maintains Gaussianity, and closed-form recursions for mean and covariance yield linear W₂-convergence. Particle-based (sample-wise) push-forward strategies with optional heat flow accurately reflect continuous-time contraction, even in high dimensions (Salim et al., 2020).

6. Extensions, Applications, and Open Directions

The FB splitting framework for Wasserstein gradient flows enables:

Handling composite objectives with both smooth and nonsmooth contributions,
Direct generalization from Euclidean optimization,
Provable convergence under geodesic convexity,
Scalability to high dimensions when approximate or closed-form JKO operators are available.

Ongoing research targets efficient algorithms for more general energy landscapes (including non-convex energies, non-Euclidean underlying domains), adaptive schemes, high-dimensional and large-scale applications, and connections to stochastic optimization and sampling (Salim et al., 2020).

Table: Summary of Classical vs. Proximal-Splitting WGF Schemes

Method	Iteration Definition	Complexity per Step
JKO	$\mu_t = ((1-t)\mathrm{Id} + t\,T)_{\#}\mu$ 8	Full proximal (often hard/expensive)
FB-Splitting	First pushforward by $\mu_t = ((1-t)\mathrm{Id} + t\,T)_{\#}\mu$ 9, then Prox $T$ 0	Cheaper: only Prox $T$ 1

The Wasserstein Proximal Gradient framework thus defines and analyzes an efficient and theoretically well-founded approach to composite optimization over the space of measures, with direct applicability to variational inference, sampling, and PDE evolution models (Salim et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

The Wasserstein Proximal Gradient Algorithm (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Wasserstein Gradient Flows (WGF).