Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fisher-Orthogonal Projected Natural Gradient

Updated 26 January 2026
  • FOPNG is an optimization framework that integrates natural gradient descent with Fisher-orthogonal projections, ensuring reparameterization invariance and reducing catastrophic forgetting.
  • It employs structured Fisher approximations like EKFAC, KFAC, and diagonal methods to maintain curvature information while scaling efficiently for deep networks.
  • FOPNG is applied in continual and large-batch learning settings, where it improves convergence speed, preserves previous task performance, and enhances overall model robustness.

Fisher-Orthogonal Projected Natural Gradient Descent (FOPNG) is a class of optimization algorithms that merges the principles of natural gradient descent, Fisher information–aware preconditioning, and orthogonal projection techniques. Its defining hallmark is the explicit use of the Fisher-Riemannian geometric structure to inform both step directions and subspace projections, yielding update schemes that are invariant under model reparametrization, offer theoretically sound mitigation of catastrophic forgetting, and deliver improved conditioning and robustness in large-scale or sequential learning settings. FOPNG methods have been developed independently across several lines of research, including continual learning, large-batch optimization, and adaptive representation training.

1. Information Geometry and the Natural Gradient

The foundation of FOPNG is the information geometry of parameterized families of probability distributions, where a Riemannian metric is induced by the Fisher information matrix. Given a neural network model pθ(yx)p_\theta(y|x), the Fisher information matrix is defined as:

F(θ)=ExEyx[θlogpθ(yx)θlogpθ(yx)].F(\theta) = \mathbb{E}_{x}\mathbb{E}_{y|x}\left[\nabla_\theta \log p_\theta(y|x)\, \nabla_\theta \log p_\theta(y|x)^\top\right].

The associated natural gradient for a loss function L(θ)L(\theta) is

δnat=F(θ)1θL(θ),\delta_{\mathrm{nat}} = F(\theta)^{-1} \nabla_\theta L(\theta),

which gives the steepest-descent direction on the manifold of output distributions under the Fisher metric. This direction is invariant under smooth, bijective reparametrization and often accelerates convergence relative to the Euclidean gradient (Garg et al., 19 Jan 2026, Yadav et al., 24 Aug 2025).

2. Fisher-Orthogonal Projection: Algorithms and Closed-Form Updates

FOPNG augments natural gradient descent with an orthogonal projection step performed in the Fisher metric. In continual learning settings, to prevent interference with previously acquired tasks, FOPNG projects the current natural gradient onto the Fisher-orthogonal complement of stored natural gradients from previous tasks:

g~=gnatvSgnat,vFv,vFv,\tilde{g} = g_\mathrm{nat} - \sum_{v \in S} \frac{\langle g_\mathrm{nat}, v\rangle_F }{\langle v, v\rangle_F} v,

where gnat=F1θL(θ)g_\mathrm{nat} = F^{-1}\nabla_\theta L(\theta), SS is the set of stored task gradients, and the inner product is u,vF=uFv\langle u,v\rangle_F = u^\top F v.

A general constrained optimization formulation, targeting both trust region and orthogonality constraints, leads to the closed-form FOPNG update (Garg et al., 19 Jan 2026):

  • Let GG be the matrix of previous task gradients and FoldF_\mathrm{old} the Fisher matrix of old tasks.
  • For a new task with Fisher FnewF_\mathrm{new} and loss gradient gg:

v=ϵ[IFnew1G(GFnew1G)1G]Fnew1g[Fnew1g][IFnew1G(GFnew1G)1G]Fnew1g.v^* = \epsilon \frac{ [I - F_\mathrm{new}^{-1}G(G^\top F_\mathrm{new}^{-1}G)^{-1}G^\top ] F_\mathrm{new}^{-1}g }{ \sqrt{ [F_\mathrm{new}^{-1}g]^\top [I - F_\mathrm{new}^{-1}G(G^\top F_\mathrm{new}^{-1}G)^{-1}G^\top ] F_\mathrm{new}^{-1}g } }.

This projected direction both respects a trust-region constraint in the Fisher norm and excises all components in directions associated to previously learned tasks (Garg et al., 19 Jan 2026).

In large-batch settings, the key geometric operation instead combines an average gradient gavgg_{\mathrm{avg}} and a Fisher-orthogonalized variance direction gdiffg_{\mathrm{diff}}^\perp, constructed by splitting batches and computing (Lu et al., 19 Aug 2025):

gdiff=gdiffsprojgavg,g_{\mathrm{diff}}^\perp = g_{\mathrm{diff}} - s_{\mathrm{proj}} g_{\mathrm{avg}},

where sproj=gdiff,gavgFgavg,gavgF+εs_{\mathrm{proj}} = \frac{\langle g_{\mathrm{diff}}, g_{\mathrm{avg}}\rangle_F}{\langle g_{\mathrm{avg}}, g_{\mathrm{avg}}\rangle_F+\varepsilon}, and the update direction is gcomb=gavg+βgdiffg_{\mathrm{comb}} = g_{\mathrm{avg}} + \beta g_{\mathrm{diff}}^\perp.

3. Efficient Fisher Preconditioning: EKFAC, KFAC, and Whitening

Direct manipulation of the full Fisher matrix is computationally intractable for modern neural networks. FOPNG implementations employ a variety of structured or diagonal approximations:

  • EKFAC/KFAC: Layerwise Kronecker-factored approximations, sometimes diagonalized in a joint eigenbasis, enable tractable inversion and preconditioning. For block FABF_\ell \approx A_\ell \otimes B_\ell, preconditioning follows

gnat(QA ⁣ ⁣QB)(ΛA1 ⁣ ⁣ΛB1)(QA ⁣ ⁣QB)g,g_\ell^\mathrm{nat} \approx (Q_{A_\ell}\!\otimes\!Q_{B_\ell})\,(\Lambda_{A_\ell}^{-1}\!\otimes\!\Lambda_{B_\ell}^{-1})\,(Q_{A_\ell}\!\otimes\!Q_{B_\ell})^\top g_\ell,

typically with damping (Yadav et al., 24 Aug 2025, Lu et al., 19 Aug 2025).

  • Diagonal Fisher: In continual learning applications, the empirical diagonal Fisher

F^diag(θ)=1D(x,y)[θlogpθ(yx)][θlogpθ(yx)]\widehat F_\mathrm{diag}(\theta) = \frac{1}{|D|}\sum_{(x,y)} [\nabla_\theta \log p_\theta(y|x)] \odot [\nabla_\theta \log p_\theta(y|x)]

is leveraged for O(p)O(p) memory and computation (Garg et al., 19 Jan 2026).

  • PRONG: Whitening methods orthonormalize per-layer activations under the Fisher metric, transforming coordinates so that the Fisher matrix is close to block-diagonal identity, thus making the natural gradient effectively Euclidean in the whitened parameterization (Desjardins et al., 2015).

4. Theoretical Guarantees: Orthogonality, Descent, and Invariance

FOPNG’s update directions possess a suite of desirable properties:

  • Descent Guarantee: The projected natural gradient g~-\tilde{g} remains a valid descent direction under the Fisher metric, i.e.,

g~,gnatF=g~F20.\langle -\tilde{g}, g_\mathrm{nat} \rangle_F = -\|\tilde{g}\|_F^2 \le 0.

By removing components in the span of previous gradients, performance on earlier tasks remains unchanged to first order (Yadav et al., 24 Aug 2025, Garg et al., 19 Jan 2026).

  • Reparameterization Invariance: By construction, all objectives and constraints are expressed in Fisher inner products, guaranteeing update invariance under smooth reparameterizations of the network parameters (Garg et al., 19 Jan 2026).
  • Preservation of Old Task Outputs: For continual learning, the projection in Fisher space ensures that the first-order change in KL divergence for previous tasks vanishes, maintaining past performance (Garg et al., 19 Jan 2026).
  • Curvature Information: In large-batch regimes, variance corrections by Fisher-orthogonal projectors mitigate the loss of curvature information due to extreme damping in standard KFAC or NGD, preserving second-order scaling (Lu et al., 19 Aug 2025).

5. Variants and Practical Implementations

Several concrete instantiations of the FOPNG paradigm have been developed, each tailored to a specific optimization challenge:

Variant/Method Setting Fisher Approximation Projection Basis
ONG/FOPNG Continual learning EKFAC (block; KFAC) Stored natural grad.
FOPNG (large-batch) Large-batch optimization KFAC Batch-var orthogonal
PRONG General Layerwise whitening ZCA whitening
FOPNG-diag Continual learning Diagonal empirical Past task gradients

Continual learning variants maintain a buffer of previous task gradients and periodically update an empirical Fisher (either retaining or exponentially averaging Fisher estimates). FOPNG in large-batch scenarios adaptively computes a layerwise mixing coefficient β\beta and periodically updates the KFAC-based curvature estimates. In practice, diagonal or structured Fisher approximations allow scaling to modern network sizes: per-batch complexity is O(p+pk+k3)O(p + pk + k^3) for buffer size kk, while Fisher/KFAC updates have the same amortized cost profile as their underlying natural gradient variants (Yadav et al., 24 Aug 2025, Garg et al., 19 Jan 2026, Lu et al., 19 Aug 2025, Desjardins et al., 2015).

6. Empirical Performance and Benchmarking

Benchmarks across continual learning and large-batch regimes reveal distinctive performance trends for FOPNG variants:

  • Continual Learning (Garg et al., 19 Jan 2026, Yadav et al., 24 Aug 2025):
    • On Split-MNIST, Split-CIFAR10, and Rotated-MNIST, FOPNG and its PreFisher variant outperform EWC, OGD, and vanilla natural gradient by 3–7% in average accuracy, with superior retention of prior task performance.
    • On Permuted-MNIST, FOPNG underperforms slightly, likely due to the highly out-of-distribution character of the task sequence.
    • FOPNG exhibits robustness to hyperparameter choices, including the Fisher EMA weight and regularization λ\lambda.
  • Large-Batch Training (Lu et al., 19 Aug 2025):
    • FOPNG delivers faster convergence and improved generalization on CIFAR-10, ImageNet-100, ImageNet-1K, and long-tailed CIFAR-LT, surpassing both SGD/AdamW and conventional KFAC in time-to-target accuracy and error reduction.
    • Only FOPNG maintained effectiveness in the extreme large-batch regime (e.g., B=50,000B=50{,}000).
  • Activation Whitening (PRONG) (Desjardins et al., 2015):
    • Dramatically improves Fisher conditioning (e.g., condition number reduction from ≈200 to ≈10), which yields up to 10× fewer required updates for convergence on autoencoder/MNIST tasks.
    • Applied to high-dimensional image classification (e.g., ImageNet), PRONG achieves or exceeds BatchNorm’s test accuracy and convergence speed despite moderate amortized overhead.

Detailed per-task results and ablations confirm FOPNG’s ability to scale, preserve memory, and maintain strong generalization across a spectrum of data regimes.

7. Connections, Limitations, and Theoretical Context

FOPNG unifies the viewpoints of orthogonal projection (OGD) and information geometry, generalizing classical OGD from Euclidean to Fisher space. Unlike projection methods in parameter space, Fisher-orthogonality aligns with the sensitivity of a model’s output distributions, not just raw parameters, providing principled guarantees for both robustness and representational invariance (Garg et al., 19 Jan 2026, Yadav et al., 24 Aug 2025).

While theoretically sound, some instantiations (e.g., ONG on simple MNIST permutations) reveal a trade-off in practice: overly aggressive Fisher preconditioning or orthogonality enforcement can impair adaptation rate in highly variant or out-of-distribution settings (Yadav et al., 24 Aug 2025). In large-scale scenarios, the main computational bottleneck derives from Fisher estimation and matrix operations, which are alleviated through structured and amortized approaches.

Overall, Fisher-Orthogonal Projected Natural Gradient Descent provides an adaptable optimization meta-scheme for scenarios where both geometry and past knowledge constraints must be balanced, and has been substantiated by recent results across continual, large-batch, and deep network training domains (Garg et al., 19 Jan 2026, Yadav et al., 24 Aug 2025, Lu et al., 19 Aug 2025, Desjardins et al., 2015).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fisher-Orthogonal Projected Natural Gradient Descent (FOPNG).