Fisher–Rao Regularization

Updated 1 February 2026

Fisher–Rao regularization is an information-geometric method that uses the Fisher–Rao metric to enforce parametrization-invariant updates in statistical models.
It employs squared geodesic distances, trace penalties, and birth–death flows to enhance model robustness, convergence, and overall generalization.
Applications span adversarial defense, optimal transport, and infinite-dimensional inference, offering both theoretical guarantees and empirical performance gains.

Fisher–Rao regularization is an information-geometric technique for machine learning and statistical inference that incorporates the Fisher–Rao Riemannian metric into regularization objectives. This approach penalizes deviations from a reference or prior via the intrinsic geometry of probability distributions, yielding parametrization-invariant and optimally dissipative updates. The Fisher–Rao regularizer can take multiple concrete forms depending on model class, ranging from squared geodesic distances on statistical manifolds to trace penalties on Fisher information and birth–death flows in mean-field games. It is uniquely characterized by Čencov’s theorem for invariance under sufficient statistics, and has demonstrated theoretical and empirical advantages in generalization, robustness, and optimization convergence.

1. Mathematical Definition of the Fisher–Rao Regularizer

The Fisher–Rao metric arises from the Fisher information matrix, which for a parametric statistical manifold $\mathcal{S} = \{p(x;\theta)\}$ with $\theta \in \mathbb{R}^d$ is given by

$G_{ij}(\theta) = \mathbb{E}_{p(x;\theta)} \Bigl[ \partial_{\theta_i}\log p(x;\theta) \, \partial_{\theta_j}\log p(x;\theta) \Bigr].$

This induces a Riemannian metric on the parameter space: $ds^2 = d\theta^T \, G(\theta) \, d\theta.$

The Fisher–Rao (FR) geodesic distance between two parameter points $\theta_1, \theta_2$ is defined as the minimal path-length in this metric: $d_{\mathrm{FR}}(\theta_1, \theta_2) = \inf_\gamma \int_0^1 \sqrt{ \dot{\gamma}(t)^T \, G(\gamma(t)) \, \dot{\gamma}(t) } \, dt,$ with $\gamma(0) = \theta_1$ , $\gamma(1) = \theta_2$ .

The Fisher–Rao regularization penalty is constructed by adding a term proportional to $d_{\mathrm{FR}}^2(\theta, \theta_0)$ to the objective, where $\theta_0$ is a reference configuration, resulting in: $J(\theta) = L(\theta) + \lambda \, d_{\mathrm{FR}}^2(\theta, \theta_0).$ Closed-form FR distances exist for several exponential families and multinomial distributions (Miyamoto et al., 2023), and, for discrete probabilities $p,q$ ,

$d_{\mathrm{FR}}(p,q) = 2 \arccos \sum_{i=1}^n \sqrt{p_i q_i}.$

2. Birth–Death Gradient Flows and Mean-Field Min–Max Games

Fisher–Rao regularization manifests in continuous-time gradient flows relevant to mean-field games and optimization. In the context of entropic convex–concave min–max problems (Lascu et al., 2024), consider the objective: $V^\sigma(\nu, \mu) = F(\nu, \mu) + \frac{\sigma^2}{2} \left[ D_{\mathrm{KL}}(\nu \mid \pi) - D_{\mathrm{KL}}(\mu \mid \rho) \right],$ where $F$ is convex–concave and $\pi, \rho$ are reference measures. The Fisher–Rao gradient flow is given by coupled birth–death partial differential equations: $\begin{align*} \partial_t \nu_t(x) &= -a(\nu_t, \mu_t, x) \, \nu_t(x), \ \partial_t \mu_t(y) &= b(\nu_t, \mu_t, y) \, \mu_t(y), \end{align*}$ where $a(\nu, \mu, x)$ and $b(\nu, \mu, y)$ are functional derivatives of $V^\sigma$ .

The Lyapunov analysis demonstrates exponential convergence to the unique mixed Nash equilibrium under strong convexity/concavity, with error decay bounds: $D_{\mathrm{KL}}(\nu^*_\sigma \mid \nu_t) + D_{\mathrm{KL}}(\mu^*_\sigma \mid \mu_t) \leq e^{- \sigma^2 t / 2} \left[ D_{\mathrm{KL}}(\nu^*_\sigma \mid \nu_0) + D_{\mathrm{KL}}(\mu^*_\sigma \mid \mu_0) \right].$ This framework allows entropy-based birth–death dynamics with mass transfer and strictly stabilizing geometric structure (Lascu et al., 2024).

3. Implications for Adversarial Robustness and Generalization

Fisher–Rao regularization underlies several recent advances in adversarial and robust classification. FIRE (Fisher–Rao Information-geometric Regularization) employs a geodesic penalty on the probability simplex between natural and adversarial outputs: $L_{\mathrm{FIRE}}(\theta) = \mathbb{E}_{(x, y) \sim p} \left[ \max_{\|x' - x\| \leq \varepsilon} \left\{ -\ln q_\theta(y \mid x') + \lambda \, d_{\mathrm{FR}}(q_\theta(\cdot \mid x), q_\theta(\cdot \mid x')) \right\} \right],$ with closed-form expressions for binary and multiclass softmax outputs (Picot et al., 2021).

Empirical work demonstrates Pareto-optimal trade-offs in robustness versus accuracy, outperforming KL-regularized baselines such as TRADES. Fisher–Rao regularization provides direct control over how output distributions change under adversarial perturbation, yielding consistent improvements in both clean and robust accuracy, and reducing computational cost (Yin et al., 2024, Picot et al., 2021).

4. Optimal Transport, Interpolating Metrics, and Dynamical Formulations

The Fisher–Rao metric interpolates between classical optimal transport and purely entropic distances. The “WF $_\delta$ ” metric is defined over non-negative measures by the variational principle (Chizat et al., 2015): $\inf_{(\rho, v, g)} \int_0^1 \int_\Omega \left( |v|^2 \rho + \delta^2 g^2 \rho \right) \, dx \, dt$ subject to $\partial_t \rho + \nabla \cdot (\rho v) = g \rho$ .

Limiting cases recover Wasserstein-2 for $\delta \rightarrow \infty$ and Fisher–Rao interpolation for $\delta \rightarrow 0$ . Applications include image interpolation, with practical solvers via proximal splitting. Geodesics in WF $_\delta$ interpolate between mass transport and creation/annihilation, yielding robust computation and faithful modeling of flows in biological and physical systems (Elkin et al., 2019, Chizat et al., 2015).

5. Infinite-Dimensional and Non-Parametric Extensions

For non-parametric models, Fisher–Rao regularization faces analytical and computational challenges due to the infinite-dimensional nature of score functions. Orthogonal decomposition of the tangent space $T_f M = S \oplus S^\perp$ leads to the Covariate Fisher Information Matrix $\mathcal{G}_f$ : $H_G(f) = \mathrm{Tr}(\mathcal{G}_f),$ which acts as a tractable regularizer and a geometric invariant for the “explainable” statistical information (Cheng et al., 25 Dec 2025). These constructions yield generalized Cramér–Rao lower bounds and allow consistent estimation of the intrinsic dimension underlying manifold hypotheses in high-dimensional data.

6. Computational Implementation and Algorithmic Variants

Fisher–Rao regularization is realized in practice via closed-form geodesic expressions, local quadratic (Mahalanobis) approximations, or empirical score-norm surrogates (Miyamoto et al., 2023, Jia et al., 2019). For deep neural networks, regularization terms can include log-determinant or trace penalties on Fisher matrices computed per mini-batch: $\mathcal{L}_{\mathrm{reg}}(\theta) = \mathcal{L}(\theta) + \beta \, \frac{1}{M} \sum_{i=1}^M [\mathcal{L}(\mathcal{B}_i;\theta) - \mathcal{L}(\mathcal{B}_i;\theta - \alpha g_i)].$ Diagonal or low-rank approximations in high dimensions, finite-difference approaches, and automatic differentiation streamline computation.

In streaming and incremental learning, Covariate Shift Correction (C²A) applies Fisher–Rao quadratic penalties batch-by-batch to prevent catastrophic forgetting and adapt to density shifts, maintaining only two batches and per-batch FIMs in memory (Khan et al., 18 Feb 2025).

7. Thermodynamic Optimality and Fundamental Limits

Under explicit information-geometric and physical assumptions, Fisher–Rao regularization is shown to be thermodynamically optimal (Caraffa, 24 Jan 2026). It uniquely minimizes dissipated energy under quasi-static belief state updates via the intrinsic metric. For exponential families, induced geometries are hyperbolic (Gaussian, half-plane) or von Mises (circular), yielding closed-form geodesics. Euclidean regularizers and other heuristics are proven suboptimal in general, incapable of capturing the invariant structure of statistical manifolds. The thermodynamic efficiency of learning is strictly increased under Fisher–Rao regularization, as quantified by the ratio $\eta = E_{\text{Landauer}}/E_{\text{actual}}$ . Experimental predictions link Fisher–Rao efficiency to learning phase transitions and mode collapse in unsupervised setups.

In summary, Fisher–Rao regularization operationalizes geometric information into machine learning objectives, conferring strictly invariant, theoretically optimal, and empirically effective control over model complexity, robustness, convergence, and statistical efficiency across a broad range of contexts including adversarial defense, optimal transport, mean-field games, and infinite-dimensional settings (Lascu et al., 2024, Picot et al., 2021, Chizat et al., 2015, Cheng et al., 25 Dec 2025, Caraffa, 24 Jan 2026).