Papers
Topics
Authors
Recent
Search
2000 character limit reached

Natural Gradient Descent (NGD) Overview

Updated 28 January 2026
  • Natural Gradient Descent (NGD) is a Riemannian optimization method that leverages the Fisher Information Matrix to precondition parameter updates for faster convergence.
  • NGD employs scalable approximations like K-FAC and block-diagonal methods, enabling effective application in deep learning, variational inference, and scientific computing.
  • Recent advances in NGD, including structured and component-wise approaches, have improved both convergence speed and generalization in complex, high-dimensional models.

Natural Gradient Descent (NGD) is a Riemannian second-order optimization method that uses the local geometry of the statistical model to precondition parameter updates. Rather than operating in the Euclidean metric, NGD leverages the Fisher Information Matrix (FIM) as a metric tensor, yielding updates invariant under smooth reparameterizations and, in ideal cases, faster convergence than conventional gradient descent. Over the past decade, NGD has seen a rapid evolution from foundational theory in information geometry to scalable implementations in deep learning, variational inference, and structured models. This article offers a comprehensive technical overview, including the mathematical foundations, algorithmic forms, approximation schemes, recent empirical findings, and crucial limitations.

1. Mathematical Foundations and Derivation

Let p(xθ)p(x|\theta) denote a parametric probabilistic model, with negative log-likelihood loss L(θ)=i=1nlogp(xiθ)L(\theta) = -\sum_{i=1}^n \log p(x_i|\theta). Standard gradient descent updates via θt+1=θtηθL(θt)\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t), seeking the steepest descent with respect to the Euclidean norm. NGD arises by recognizing that Δθ22\|\Delta\theta\|^2_2 is not affine-invariant and that the parameter space of statistical models is more appropriately endowed with the Riemannian metric given by the FIM:

F(θ)=Ep(xθ)[θlogp(xθ)θlogp(xθ)].F(\theta) = \mathbb{E}_{p(x|\theta)} \left[ \nabla_\theta \log p(x|\theta) \nabla_\theta \log p(x|\theta)^\top \right] .

This FIM is equivalently the negative expected Hessian of the log-likelihood. The update direction minimizing L(θ+Δθ)L(\theta + \Delta\theta) under a KL-divergence constraint KL(pθpθ+Δθ)ε2KL(p_\theta\|p_{\theta+\Delta\theta}) \le \varepsilon^2 is given by

ΔθF(θ)1θL(θ),\Delta\theta \propto - F(\theta)^{-1} \nabla_\theta L(\theta) ,

leading to the canonical NGD update:

θt+1=θtηF(θt)1θL(θt).\theta_{t+1} = \theta_t - \eta F(\theta_t)^{-1} \nabla_\theta L(\theta_t) .

The metric induced by F(θ)F(\theta) makes NGD the steepest descent in distribution space rather than parameter space (Shrestha, 2023).

For conditional exponential-family models and classical deep learning losses (squared error, cross-entropy), F(θ)F(\theta) coincides with the Generalized Gauss–Newton (GGN) approximation to the Hessian, a fact exploited for scalable computation.

2. Algorithmic Forms and Practical Computation

Direct computation of full NGD is infeasible in large models:

  • Forming F(θ)F(\theta) explicitly requires O(p2)O(p^2) memory, p=dim(θ)p = \text{dim}(\theta).
  • Direct inversion costs O(p3)O(p^3) flops.

The following table summarizes major scalable NGD approximations, all yielding updates of the form θt+1=θtηF^1θL\theta_{t+1} = \theta_t - \eta \hat{F}^{-1} \nabla_\theta L:

Approximation Class Storage/Time per Step Description
Diagonal Fisher O(p)O(p) Fdiag(fi)F\approx\text{diag}(f_i). Used by RMSProp/Adam.
Block-Diagonal (Layerwise) O(pl2)\sum O(p_l^2), O(pl3)O(p_l^3) Partition into blocks per layer. Still costly for wide layers.
Kronecker-Factored (K-FAC) O(di2+do2)/O(d_i^2+d_o^2)/layer FAGF \approx A\otimes G. Efficient for FC/conv layers.
Woodbury/Exact Gram (TENGraD) O(nlayersm2)O(n_{\text{layers}} m^2) Invert in m×mm\times m (batch-size) dual space. Memory-efficient when mplm\ll p_l.

Newer approaches deploy network reconstruction and block reparameterization to circumvent explicit inversions (Liu et al., 2021, Liu et al., 2024), or exploit hardware innovations such as analog thermodynamic solvers (Donatella et al., 2024).

3. Structured Natural Gradient and Network Reconstruction

Structured Natural Gradient Descent (SNGD) reframes NGD via layerwise reparameterizations. For layer ll with weight WlW_l and local Fisher block GlE[Vf]E[xx]G_l \approx \mathbb{E}[V_f] \mathbb{E}[x x^\top], with VfV_f activation-slope diagonal, one reparameterizes Wl=W~lGl1/2W_l = \tilde{W}_l G_l^{-1/2}. Standard GD on W~l\tilde{W}_l in the reconstructed network yields updates in the original parameters matching block-diagonal NGD. The local Fisher layers, inserted between standard layers, serve as normalization transformations that realize curvature correction at O(ml3)O(m_l^3) per layer and O(ml2)O(m_l^2) memory per layer, dramatically reducing total cost when channels mlm_l are moderate (Liu et al., 2021, Liu et al., 2024). Denman–Beavers or Newton matrix square-root iterations are the preferred method for G±1/2G^{\pm 1/2}.

These constructions have achieved $5$–10%10\% runtime overhead over SGD in large networks, outperforming K-FAC and blockwise NGD on convergence and test accuracy, particularly in deep architectures and in settings with ill-conditioned Fisher (Liu et al., 2024).

4. Theoretical Guarantees, Limitations, and Fast Convergence Results

NGD is locally invariant under smooth reparameterizations and, for locally quadratic loss functions, can effect convergence in a single step (matching Newton's method). In overparameterized wide networks, layerwise block-diagonal and K-FAC approximations are theoretically justified to yield exponential decay of the training objective at rates independent of minimal NTK eigenvalues, provided the approximate Fisher is “isotropic” in function space (Karakida et al., 2020). For smooth, analytic activations in two-layer Physics-Informed Neural Networks (PINNs), NGD can achieve quadratic convergence up to the statistical limit, independent of the NTK spectrum, using properly adjusted step sizes (Xu et al., 2024).

However, global convergence for general nonconvex losses is unproven; only local guarantees exist under strong convexity in the Fisher-induced KL-metric. All practical approximations—diagonal, block, K-FAC—introduce curvature estimator bias, requiring careful adjustment of learning rate and damping, especially as batch size decreases (instability at m=8m = 8, stable at m128m \ge 128) (Shrestha, 2023).

5. Empirical Performance and Practical Implications

NGD and its scalable variants (notably TENGraD, SNGD, D-NGD) have demonstrated substantial acceleration in convergence over first-order methods in both deep learning (e.g., ResNets on ImageNet and CIFAR-10, LSTMs on Penn-Treebank, MLPs on MNIST) and scientific ML (PINNs for PDEs) (Liu et al., 2024, Jnini et al., 27 May 2025). For small to moderate batch sizes, TENGraD matches or outperforms Adam/K-FAC in wall-clock efficiency and typically exhibits comparable or improved generalization (Shrestha, 2023).

Large-scale PDE and PINN applications have seen dramatic reductions in wall-clock time and final L2L^2 errors (one to three orders of magnitude) versus both classical optimizers and quasi-Newton methods (Jnini et al., 27 May 2025, Bioli et al., 16 May 2025). Recent advances, such as randomized Nyström preconditioning, have addressed the slow convergence of CG-based NGD when the Gramian is ill-conditioned, enabling near-optimal performance in a small number of iterations and restoring the practicality of NGD in high-dimensional regimes (Bioli et al., 16 May 2025).

6. Structured, Component-Wise, and Problem-Specific Approximations

Further practical advances in NGD have focused on problem structure:

  • Component-Wise NGD (CW-NGD): Decomposes each FIM block into smaller approximately independent segments (e.g., per-output-unit), yielding highly efficient block-diagonal updates well-suited for dense and convolutional network layers (Sang et al., 2022).
  • Network reconstruction SNGD: Uses explicit local Fisher normalization layers to transform ill-conditioned loss surfaces into well-spherical ones in parameter space, thus systematically accelerating convergence even for small batch sizes and ill-conditioned training scenarios (Liu et al., 2021, Liu et al., 2024).
  • Dual Formulations for PINNs: Solves for NGD steps in the residual space, which can be much lower-dimensional than parameter space, enabling scalable training of PINNs with tens of millions of parameters at feasible time and memory costs (Jnini et al., 27 May 2025).

7. Limitations, Open Questions, and Future Directions

Despite substantial progress, several limitations persist:

  • Curvature Estimator Bias: All scalable approximations (K-FAC, block-diagonal, EF, etc.) trade estimator fidelity for tractability; the impact on generalization and solution selection remains incompletely understood, especially in finite width or classification regimes with rank-deficient Fisher (Shrestha, 2023, Karakida et al., 2020).
  • Ill-conditioning and Damping: Small FIM eigenvalues (common with small batch sizes or deep nets) still cause instability; robust, automated selection of damping is an open challenge.
  • Extremely Wide Layers/Embeddings: Matrix square-root operations in SNGD and related constructions can be costly for transformers, wide ResNets, etc., suggesting the need for new low-rank or structured approximations (Liu et al., 2024).
  • Integration with first-order frameworks: Although per-epoch overhead is small for many NGD approximations, implementations must carefully integrate Fisher-preconditioning with momentum, weight decay, and distributed training requirements, which can present engineering complexity.
  • Generalization of the metric: Problem-specific, data-adaptive, and reference-manifold-based natural gradients have been theorized or implemented (e.g., for structured variational distributions, Wasserstein or Sobolev metrics in scientific computing), but rigorous analysis and best-practice guidelines are nascent (Nurbekyan et al., 2022, Bao et al., 12 Dec 2025, Dong et al., 2022).

NGD has shaped scalable, geometry-aware optimization across statistical learning, variational modeling, and scientific computation, with the landscape evolving continuously as algorithmic, hardware, and theoretical innovations progress.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Natural Gradient Descent (NGD).