Natural Gradient Descent (NGD) Overview

Updated 28 January 2026

Natural Gradient Descent (NGD) is a Riemannian optimization method that leverages the Fisher Information Matrix to precondition parameter updates for faster convergence.
NGD employs scalable approximations like K-FAC and block-diagonal methods, enabling effective application in deep learning, variational inference, and scientific computing.
Recent advances in NGD, including structured and component-wise approaches, have improved both convergence speed and generalization in complex, high-dimensional models.

Natural Gradient Descent (NGD) is a Riemannian second-order optimization method that uses the local geometry of the statistical model to precondition parameter updates. Rather than operating in the Euclidean metric, NGD leverages the Fisher Information Matrix (FIM) as a metric tensor, yielding updates invariant under smooth reparameterizations and, in ideal cases, faster convergence than conventional gradient descent. Over the past decade, NGD has seen a rapid evolution from foundational theory in information geometry to scalable implementations in deep learning, variational inference, and structured models. This article offers a comprehensive technical overview, including the mathematical foundations, algorithmic forms, approximation schemes, recent empirical findings, and crucial limitations.

1. Mathematical Foundations and Derivation

Let $p(x|\theta)$ denote a parametric probabilistic model, with negative log-likelihood loss $L(\theta) = -\sum_{i=1}^n \log p(x_i|\theta)$ . Standard gradient descent updates via $\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)$ , seeking the steepest descent with respect to the Euclidean norm. NGD arises by recognizing that $\|\Delta\theta\|^2_2$ is not affine-invariant and that the parameter space of statistical models is more appropriately endowed with the Riemannian metric given by the FIM:

$F(\theta) = \mathbb{E}_{p(x|\theta)} \left[ \nabla_\theta \log p(x|\theta) \nabla_\theta \log p(x|\theta)^\top \right] .$

This FIM is equivalently the negative expected Hessian of the log-likelihood. The update direction minimizing $L(\theta + \Delta\theta)$ under a KL-divergence constraint $KL(p_\theta\|p_{\theta+\Delta\theta}) \le \varepsilon^2$ is given by

$\Delta\theta \propto - F(\theta)^{-1} \nabla_\theta L(\theta) ,$

leading to the canonical NGD update:

$\theta_{t+1} = \theta_t - \eta F(\theta_t)^{-1} \nabla_\theta L(\theta_t) .$

The metric induced by $F(\theta)$ makes NGD the steepest descent in distribution space rather than parameter space (Shrestha, 2023).

For conditional exponential-family models and classical deep learning losses (squared error, cross-entropy), $F(\theta)$ coincides with the Generalized Gauss–Newton (GGN) approximation to the Hessian, a fact exploited for scalable computation.

2. Algorithmic Forms and Practical Computation

Direct computation of full NGD is infeasible in large models:

Forming $F(\theta)$ explicitly requires $O(p^2)$ memory, $p = \text{dim}(\theta)$ .
Direct inversion costs $O(p^3)$ flops.

The following table summarizes major scalable NGD approximations, all yielding updates of the form $\theta_{t+1} = \theta_t - \eta \hat{F}^{-1} \nabla_\theta L$ :

Approximation Class	Storage/Time per Step	Description
Diagonal Fisher	$O(p)$	$F\approx\text{diag}(f_i)$ . Used by RMSProp/Adam.
Block-Diagonal (Layerwise)	$\sum O(p_l^2)$ , $O(p_l^3)$	Partition into blocks per layer. Still costly for wide layers.
Kronecker-Factored (K-FAC)	$O(d_i^2+d_o^2)/$ layer	$F \approx A\otimes G$ . Efficient for FC/conv layers.
Woodbury/Exact Gram (TENGraD)	$O(n_{\text{layers}} m^2)$	Invert in $m\times m$ (batch-size) dual space. Memory-efficient when $m\ll p_l$ .

Newer approaches deploy network reconstruction and block reparameterization to circumvent explicit inversions (Liu et al., 2021, Liu et al., 2024), or exploit hardware innovations such as analog thermodynamic solvers (Donatella et al., 2024).

3. Structured Natural Gradient and Network Reconstruction

Structured Natural Gradient Descent (SNGD) reframes NGD via layerwise reparameterizations. For layer $l$ with weight $W_l$ and local Fisher block $G_l \approx \mathbb{E}[V_f] \mathbb{E}[x x^\top]$ , with $V_f$ activation-slope diagonal, one reparameterizes $W_l = \tilde{W}_l G_l^{-1/2}$ . Standard GD on $\tilde{W}_l$ in the reconstructed network yields updates in the original parameters matching block-diagonal NGD. The local Fisher layers, inserted between standard layers, serve as normalization transformations that realize curvature correction at $O(m_l^3)$ per layer and $O(m_l^2)$ memory per layer, dramatically reducing total cost when channels $m_l$ are moderate (Liu et al., 2021, Liu et al., 2024). Denman–Beavers or Newton matrix square-root iterations are the preferred method for $G^{\pm 1/2}$ .

These constructions have achieved $5$– $10\%$ runtime overhead over SGD in large networks, outperforming K-FAC and blockwise NGD on convergence and test accuracy, particularly in deep architectures and in settings with ill-conditioned Fisher (Liu et al., 2024).

4. Theoretical Guarantees, Limitations, and Fast Convergence Results

NGD is locally invariant under smooth reparameterizations and, for locally quadratic loss functions, can effect convergence in a single step (matching Newton's method). In overparameterized wide networks, layerwise block-diagonal and K-FAC approximations are theoretically justified to yield exponential decay of the training objective at rates independent of minimal NTK eigenvalues, provided the approximate Fisher is “isotropic” in function space (Karakida et al., 2020). For smooth, analytic activations in two-layer Physics-Informed Neural Networks (PINNs), NGD can achieve quadratic convergence up to the statistical limit, independent of the NTK spectrum, using properly adjusted step sizes (Xu et al., 2024).

However, global convergence for general nonconvex losses is unproven; only local guarantees exist under strong convexity in the Fisher-induced KL-metric. All practical approximations—diagonal, block, K-FAC—introduce curvature estimator bias, requiring careful adjustment of learning rate and damping, especially as batch size decreases (instability at $m = 8$ , stable at $m \ge 128$ ) (Shrestha, 2023).

5. Empirical Performance and Practical Implications

NGD and its scalable variants (notably TENGraD, SNGD, D-NGD) have demonstrated substantial acceleration in convergence over first-order methods in both deep learning (e.g., ResNets on ImageNet and CIFAR-10, LSTMs on Penn-Treebank, MLPs on MNIST) and scientific ML (PINNs for PDEs) (Liu et al., 2024, Jnini et al., 27 May 2025). For small to moderate batch sizes, TENGraD matches or outperforms Adam/K-FAC in wall-clock efficiency and typically exhibits comparable or improved generalization (Shrestha, 2023).

Large-scale PDE and PINN applications have seen dramatic reductions in wall-clock time and final $L^2$ errors (one to three orders of magnitude) versus both classical optimizers and quasi-Newton methods (Jnini et al., 27 May 2025, Bioli et al., 16 May 2025). Recent advances, such as randomized Nyström preconditioning, have addressed the slow convergence of CG-based NGD when the Gramian is ill-conditioned, enabling near-optimal performance in a small number of iterations and restoring the practicality of NGD in high-dimensional regimes (Bioli et al., 16 May 2025).

6. Structured, Component-Wise, and Problem-Specific Approximations

Further practical advances in NGD have focused on problem structure:

Component-Wise NGD (CW-NGD): Decomposes each FIM block into smaller approximately independent segments (e.g., per-output-unit), yielding highly efficient block-diagonal updates well-suited for dense and convolutional network layers (Sang et al., 2022).
Network reconstruction SNGD: Uses explicit local Fisher normalization layers to transform ill-conditioned loss surfaces into well-spherical ones in parameter space, thus systematically accelerating convergence even for small batch sizes and ill-conditioned training scenarios (Liu et al., 2021, Liu et al., 2024).
Dual Formulations for PINNs: Solves for NGD steps in the residual space, which can be much lower-dimensional than parameter space, enabling scalable training of PINNs with tens of millions of parameters at feasible time and memory costs (Jnini et al., 27 May 2025).

7. Limitations, Open Questions, and Future Directions

Despite substantial progress, several limitations persist:

Curvature Estimator Bias: All scalable approximations (K-FAC, block-diagonal, EF, etc.) trade estimator fidelity for tractability; the impact on generalization and solution selection remains incompletely understood, especially in finite width or classification regimes with rank-deficient Fisher (Shrestha, 2023, Karakida et al., 2020).
Ill-conditioning and Damping: Small FIM eigenvalues (common with small batch sizes or deep nets) still cause instability; robust, automated selection of damping is an open challenge.
Extremely Wide Layers/Embeddings: Matrix square-root operations in SNGD and related constructions can be costly for transformers, wide ResNets, etc., suggesting the need for new low-rank or structured approximations (Liu et al., 2024).
Integration with first-order frameworks: Although per-epoch overhead is small for many NGD approximations, implementations must carefully integrate Fisher-preconditioning with momentum, weight decay, and distributed training requirements, which can present engineering complexity.
Generalization of the metric: Problem-specific, data-adaptive, and reference-manifold-based natural gradients have been theorized or implemented (e.g., for structured variational distributions, Wasserstein or Sobolev metrics in scientific computing), but rigorous analysis and best-practice guidelines are nascent (Nurbekyan et al., 2022, Bao et al., 12 Dec 2025, Dong et al., 2022).

NGD has shaped scalable, geometry-aware optimization across statistical learning, variational modeling, and scientific computation, with the landscape evolving continuously as algorithmic, hardware, and theoretical innovations progress.