Wasserstein-Fisher-Rao Gradient Flow

Updated 18 January 2026

Wasserstein-Fisher-Rao gradient flow is a method that integrates mass transport (Wasserstein) and mass reweighting (Fisher-Rao) to optimize sampling and inference in complex, multimodal distributions.
It employs PDE representations, weighted SDE schemes, and operator splitting methods to provide rigorous convergence guarantees and scalable algorithmic implementations.
Practical algorithms using particle methods and splitting strategies overcome limitations of classical models by efficiently navigating nonconvex probability landscapes with exponential contraction in log-concave settings.

The Wasserstein-Fisher-Rao (WFR) Gradient Flow Algorithm is a modern approach to generative modeling, sampling, and inference that generalizes classical Wasserstein gradient flows by incorporating both mass transport and mass variation dynamics. This framework extends the underlying geometry from the balanced optimal transport setting (Wasserstein only) to an unbalanced regime where the Fisher–Rao geometry governs creation and destruction of mass, enhancing the capacity to efficiently sample and optimize in multimodal or nonconvex probability landscapes. The WFR gradient flow admits rigorous partial differential equation (PDE) and stochastic process representations, sharp convergence guarantees, and several practical operator-splitting and Monte Carlo algorithms suitable for both theoretical investigation and scalable implementation.

1. Mathematical Structure of the WFR Gradient Flow

WFR geometry equips the space of positive measures on $\mathbb{R}^d$ with a dynamic Riemannian metric blending the $L^2$ -Wasserstein (mass transport) and Fisher–Rao (mass reweighting) contributions. The squared WFR distance between two measures $\mu_0, \mu_1$ is defined by

$\mathrm{WFR}^2(\mu_0, \mu_1) = \inf \int_0^1\!\!\int_{\mathbb{R}^d} \Big( |v_t(x)|^2 + |\varphi_t(x)|^2 \Big) \mu_t(x) dx dt$

subject to the continuity-reaction equation:

$\partial_t \mu_t + \nabla \cdot (\mu_t v_t) = \mu_t \varphi_t,$

where $(v_t, \varphi_t)$ are time-dependent velocity and reaction fields, and $(\mu_t)_{t\in[0,1]}$ is a path from $\mu_0$ to $\mu_1$ (Rahimi, 19 Dec 2025, Crucinio et al., 6 Jun 2025).

The induced Riemannian metric on the tangent space at $\mu$ is

$\langle (\varphi_1, v_1), (\varphi_2, v_2) \rangle_\mu = \int (\varphi_1 \varphi_2 + v_1 \cdot v_2) d\mu,$

which is an $H^1$ -type metric.

The gradient flow of the Kullback–Leibler divergence functional $F(\mu) = \mathrm{KL}(\mu \Vert \pi)$ in the WFR geometry is governed by the PDE

$\partial_t \mu = -\nabla \cdot (\mu \nabla U) - (U - m)\mu,$

where $U(x) = \log(\mu(x)/\pi(x)) + 1$ , $m = \int U\, d\mu$ (Rahimi, 19 Dec 2025, Crucinio et al., 22 Nov 2025).

2. Algorithmic Realizations: Weighted SDE and Operator Splitting

The WFR gradient-flow admits both weighted stochastic differential equation (SDE) and operator splitting (Trotter) discretizations.

Weighted SDE Scheme

The Feynman–Kac representation describes the evolution via interacting particles $(X_t^i, w_t^i)$ :

$X_t^i$ : time evolution via the SDE $dX_t = v(X_t)dt + \sqrt{2} dB_t$ , with $v(x) = -\nabla U(x)$ ,
$w_t^i$ : weight evolution $dw_t = -[U(X_t) - m]dt$ ,
Empirical observables are approximated via

$\frac{\sum_i e^{w_T^i} \varphi(X_T^i)}{\sum_i e^{w_T^i}}$

as $N \to \infty$ (Rahimi, 19 Dec 2025).

Operator Splitting (W-FR and FR-W)

Time discretization via splitting alternates pure-Wasserstein (W) and pure-Fisher-Rao (FR) updates:

W step: $\nu_{n+1} = \arg\min_\nu \frac{1}{2\gamma} W_2^2(\nu, \mu_n) + \mathrm{KL}(\nu\|\pi)$ ,
FR step: $\mu_{n+1} = \arg\min_\nu \frac{1}{2\gamma} D_{FR}^2(\nu, \nu_{n+1}) + \mathrm{KL}(\nu\|\pi)$ .

The two possible orderings (W–FR, FR–W) have distinctive splitting errors and practical implications: W–FR disperses mass before reweighting, FR–W sharpens/weights before transporting. Convergence rates and practical behavior can depend significantly on operator ordering and step size (Crucinio et al., 22 Nov 2025, Crucinio et al., 6 Jun 2025, Halder et al., 2017).

3. Theoretical Properties and Convergence Guarantees

When the target $\pi$ is strongly log-concave (i.e., $V = -\log \pi$ is $\alpha$ -convex), the WFR gradient flow is $\alpha$ -geodesically convex in the WFR metric, yielding exponential contraction:

$\mathrm{WFR}(\mu_t, \pi) \leq e^{-\alpha t} \mathrm{WFR}(\mu_0, \pi)$

(Rahimi, 19 Dec 2025, Crucinio et al., 22 Nov 2025).

A sharp symmetrized Kullback–Leibler (Jeffreys) decay is available:

$J(\mu_t, \pi) = \mathrm{KL}(\mu_t \Vert \pi) + \mathrm{KL}(\pi \Vert \mu_t) \leq J(\mu_0, \pi) e^{-t\kappa},\quad \kappa = 2\min(\alpha_\pi, \alpha_t) + 1$

(Crucinio et al., 22 Nov 2025).

For more general, potentially non-log-concave or multimodal targets, the reaction (mass reweighting) term facilitates efficient exploration and mixing by redistributing mass across potential barriers, addressing the well-known metastability limitations of classical Langevin diffusions (Rahimi, 19 Dec 2025, Crucinio et al., 6 Jun 2025). There is evidence of superior convergence in Gaussian and certain multimodal scenarios and formal bounds for discrete and continuous time converge at rates matching or exceeding standard diffusion-based algorithms.

4. Practical Algorithms: SMC, Particle Methods, and Filtering

Particle-based discretizations (Sequential Monte Carlo, SMC) and interacting particle systems provide scalable numerical implementations:

SMC-WFR alternates Langevin moves (for W) and importance-weighting/reweighting (for FR), with empirical weights reflecting the relative likelihood under $\pi$ and the particle approximation.
At each iteration, the W-step (e.g., Unadjusted Langevin update) is followed by an FR-step via power-posterior reweighting and resampling, maintaining particle diversity and controlling weight degeneracy.
Complexity is $O(N^2)$ per time step due to pairwise weight calculations, but empirical performance in controlled multi-modal settings demonstrates clear gains over pure diffusion or replicator-only flows (Crucinio et al., 6 Jun 2025, Yan et al., 2023).

In filtering, the WFR splitting generalizes the JKO (Wasserstein) filter and replicator (Fisher–Rao) update:

In the Gaussian-linear case, both steps are analytic and guarantee convergence to the true continuous filter as the step size vanishes.
Extension to mixtures and nonlinear settings is feasible using geodesic distance approximations and proximal maps (Halder et al., 2017).

5. Applications: Sampling, Inference, Learning, and Optimization

The WFR framework has been successfully applied in high-dimensional generative modeling, Bayesian inference, multi-objective optimization, nonparametric likelihood estimation, and beyond:

Generative Models & Score-based Diffusions: Mitigating poor mixing in nonconvex, multimodal targets by enabling probability mass to teleport across low-probability regions (Rahimi, 19 Dec 2025).
Multi-Objective Optimization: Combining Langevin transport with birth–death particle dynamics allows both movement toward the global Pareto front and aggressive elimination of dominated solutions (Ren et al., 2023).
Gaussian Mixture Learning: The WFR gradient flow equipped with a particle system (interleaved location and weight updates) achieves globally convergent nonparametric maximum likelihood estimation, outperforming EM and single-metric gradient descent algorithms (Yan et al., 2023).

The methodology generalizes to any setting where the optimization or sampling objective admits a PDE-driven gradient-flow interpretation in the space of probability measures, and is flexible with respect to kernel-based transport or interaction structures (Zhu, 2024).

6. Extensions: Inclusive KL, Tempering, and Generalized Metrics

Recent work extends the WFR gradient-flow to settings where the objective is the inclusive (forward) KL divergence, with strong exponential convergence guarantees that do not depend on log-concavity. The framework unifies reproducing kernel Hilbert space MMD flows, kernel Stein discrepancy flows, and birth–death accelerated flows through an Interaction-Force-Transport PDE, and is amenable to kernelization, yielding particle schemes compatible with modern machine learning architectures (Zhu, 2024).

Tempering strategies that interpolate between the initial distribution and the target do not accelerate convergence in continuous time, and may introduce bias; directly optimizing the WFR-flow remains preferable (Crucinio et al., 6 Jun 2025).

7. Limitations, Operator-Splitting Trade-Offs, and Future Directions

Operator splitting introduces nontrivial trade-offs in discretization error versus acceleration. In particular, judicious tuning of the time step and operator order can lead to faster convergence than the exact flow, especially far from equilibrium, but excessively large steps increase splitting error and may degrade performance (Crucinio et al., 22 Nov 2025, Crucinio et al., 6 Jun 2025, Halder et al., 2017).

Computationally, the main bottleneck is the cost of weight and kernel evaluations in the particle algorithms, which scales quadratically with particle number $N$ in typical SMC realizations, although optimizations via fast transforms or kernel sparsification are possible (Yan et al., 2023, Crucinio et al., 6 Jun 2025).

The WFR framework is expected to guide future developments in scalable geometry-aware sampling, robust optimization in complex (unbalanced and multimodal) spaces, and theory-driven algorithm design at the interface of optimal transport, information geometry, and stochastic analysis.