Papers
Topics
Authors
Recent
Search
2000 character limit reached

Wasserstein Gradient Flows (WGF)

Updated 7 February 2026
  • Wasserstein Gradient Flows are continuous-time dynamical systems that characterize the steepest descent evolution of functionals over probability measures via the 2-Wasserstein metric.
  • Discrete-time schemes like the JKO and forward–backward splitting methods provide practical approximations with provable convergence rates under convexity assumptions.
  • This framework bridges optimal transport, PDE analysis, and machine learning, enabling rigorous and scalable optimization in infinite-dimensional spaces.

Wasserstein Gradient Flows (WGF) are continuous-time dynamical systems that characterize the steepest descent evolution of a functional over the space of probability measures endowed with the 2-Wasserstein metric. The WGF framework provides a rigorous, geometrically-intrinsic generalization of gradient descent to infinite-dimensional spaces, with foundational relevance across optimal transport, partial differential equations, and probabilistic machine learning.

1. The 2-Wasserstein Space: Metric, Geometry, and Geodesics

The space of Borel probability measures on ℝᵈ with finite second moments,

P2(Rd):={μ probability on Rd:x2dμ(x)<},\mathcal{P}_2(\mathbb{R}^d) := \left\{ \mu \text{ probability on } \mathbb{R}^d : \int \|x\|^2\,d\mu(x) < \infty \right\},

equipped with the 2-Wasserstein distance,

W22(μ,ν):=infπΓ(μ,ν)Rd×Rdxy2dπ(x,y),W_2^2(\mu,\nu) := \inf_{\pi\in\Gamma(\mu,\nu)} \int_{\mathbb{R}^d\times \mathbb{R}^d} \|x-y\|^2\,d\pi(x,y),

becomes a geodesic metric space, where Γ(μ,ν)\Gamma(\mu,\nu) is the set of couplings of μ and ν. When μ is absolutely continuous, the optimal transport map is given by the gradient of a convex function (Brenier's theorem), and constant-speed geodesics can be constructed as pushforwards via interpolated maps: μt=((1t)Id+tT)#μ\mu_t = ((1-t)\mathrm{Id} + t\,T)_{\#}\mu for t∈[0,1], where TT is the optimal transport map. The geodesic structure is central for defining “steepest descent” in this space (Salim et al., 2020).

2. Continuous-Time Formulation: Evolution Equation and Variational Characterization

For a given functional F:P2(,+]\mathcal{F} : \mathcal{P}_2 \to (-\infty, +\infty], the curve μ(t)\mu(t) solving the Wasserstein gradient flow is characterized by the Evolution Variational Inequality (EVI):

νP2,ddtW22(μ(t),ν)2[F(μ(t))F(ν)].\forall \nu \in \mathcal{P}_2, \quad \frac{d}{dt} W_2^2(\mu(t), \nu) \leq -2 \big[ \mathcal{F}(\mu(t)) - \mathcal{F}(\nu) \big].

Under regularity conditions, this is equivalent to a PDE for the density ρ(t,x)\rho(t,x):

tρ+(ρv)=0,v=x(δFδρ),\partial_t \rho + \nabla \cdot (\rho v) = 0, \quad v = -\nabla_x \left(\frac{\delta\mathcal{F}}{\delta \rho} \right),

where δFδρ\frac{\delta\mathcal{F}}{\delta \rho} denotes the first variation of F\mathcal{F}. For example, if F(μ)=Vdμ+ρlogρ\mathcal{F}(\mu)=\int V\,d\mu + \int \rho\log\rho, the gradient flow yields the Fokker-Planck equation, a prototypical diffusive evolution (Salim et al., 2020).

3. Discrete-Time Schemes: JKO and Forward-Backward Splitting

The canonical time-discretization of WGF is the Jordan–Kinderlehrer–Otto (JKO) implicit Euler scheme:

μn+1argminμP2{F(μ)+12γW22(μ,μn)}.\mu_{n+1} \in \arg\min_{\mu \in \mathcal{P}_2} \left\{ \mathcal{F}(\mu) + \frac{1}{2\gamma} W_2^2(\mu, \mu_n) \right\}.

This yields a sequence whose piecewise-constant interpolation converges to the continuous WGF as γ0\gamma \to 0.

When the objective function decomposes as F=U+G\mathcal{F} = \mathcal{U} + \mathcal{G} with U\mathcal{U} smooth and G\mathcal{G} possibly nonsmooth but geodesically convex, the Forward–Backward (FB) proximal-gradient algorithm over P2\mathcal{P}_2 is defined as:

  • Forward (gradient) step for U\mathcal{U}: νn+1:=(IdγF)#μn\nu_{n+1} := (\mathrm{Id} - \gamma \nabla F)_{\#} \mu_n,
  • Backward (proximal) step for G\mathcal{G}: μn+1argminμ{G(μ)+(1/2γ)W22(μ,νn+1)}=:ProxγG(νn+1)\mu_{n+1} \in \arg\min_\mu \left\{ \mathcal{G}(\mu) + (1/2\gamma) W_2^2(\mu, \nu_{n+1}) \right\} =: \mathrm{Prox}_{\gamma \mathcal{G}}(\nu_{n+1}),

mirroring the classical Euclidean proximal-point framework. Here, ProxγG\mathrm{Prox}_{\gamma \mathcal{G}} is a JKO step for G\mathcal{G} only (Salim et al., 2020).

4. Convergence Theory for Proximal Splitting and Rates

Suppose FF is LL-smooth and λ\lambda-strongly convex, and G\mathcal{G} is proper, lower semicontinuous, and convex along generalized geodesics. If γ<1/L\gamma < 1/L, the FB scheme satisfies a discrete EVI:

W22(μn+1,μ)(1γλ)W22(μn,μ)2γ[F(μn+1)F(μ)].W_2^2(\mu_{n+1}, \mu_*) \leq (1-\gamma\lambda) W_2^2(\mu_n, \mu_*) - 2\gamma [\mathcal{F}(\mu_{n+1}) - \mathcal{F}(\mu_*)].

  • If λ=0\lambda=0, F(μn)F(μ)=O(1/(γn))\mathcal{F}(\mu_n)-\mathcal{F}(\mu_*) = O(1/(\gamma n)).
  • If λ>0\lambda>0, W22(μn,μ)(1γλ)nW22(μ0,μ)W_2^2(\mu_n, \mu_*) \leq (1-\gamma\lambda)^n W_2^2(\mu_0, \mu_*) (linear convergence).

This result establishes WGF-FB as an infinite-dimensional analog of the proximal gradient method, retaining convergence guarantees familiar from convex Euclidean optimization (Salim et al., 2020).

5. Practical Implementation, Computational Aspects, and Examples

Continuous-time WGF enjoys exact decay rates, while discrete-time schemes (JKO, FB) match these rates up to step-size constraints. The main numerical challenge is evaluating the proximal map (JKO subproblem), which, depending on G\mathcal{G}, may admit:

  • Closed-form solutions (e.g., negative entropy/heat flow),
  • PDE-based solvers (for more complex energies),
  • Entropic regularization or Sinkhorn algorithms for approximation.

FB splitting reduces the implicit computation to the G\mathcal{G} part only, with the U\mathcal{U} part handled by a simple push-forward. In the canonical quadratic-plus-entropy example (sampling from a Gaussian), each FB step maintains Gaussianity, and closed-form recursions for mean and covariance yield linear W₂-convergence. Particle-based (sample-wise) push-forward strategies with optional heat flow accurately reflect continuous-time contraction, even in high dimensions (Salim et al., 2020).

6. Extensions, Applications, and Open Directions

The FB splitting framework for Wasserstein gradient flows enables:

  • Handling composite objectives with both smooth and nonsmooth contributions,
  • Direct generalization from Euclidean optimization,
  • Provable convergence under geodesic convexity,
  • Scalability to high dimensions when approximate or closed-form JKO operators are available.

Ongoing research targets efficient algorithms for more general energy landscapes (including non-convex energies, non-Euclidean underlying domains), adaptive schemes, high-dimensional and large-scale applications, and connections to stochastic optimization and sampling (Salim et al., 2020).

Table: Summary of Classical vs. Proximal-Splitting WGF Schemes

Method Iteration Definition Complexity per Step
JKO μn+1argminμF(μ)+12γW22(μ,μn)\mu_{n+1} \leftarrow \arg\min_{\mu} \mathcal{F}(\mu) + \frac{1}{2\gamma}W_2^2(\mu, \mu_n) Full proximal (often hard/expensive)
FB-Splitting First pushforward by γF-\gamma\nabla F, then ProxγG_{\gamma\mathcal{G}} Cheaper: only ProxγG_{\gamma\mathcal{G}}

The Wasserstein Proximal Gradient framework thus defines and analyzes an efficient and theoretically well-founded approach to composite optimization over the space of measures, with direct applicability to variational inference, sampling, and PDE evolution models (Salim et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Wasserstein Gradient Flows (WGF).