Fisher-Orthogonal Projected Natural Gradient
- FOPNG is an optimization framework that integrates natural gradient descent with Fisher-orthogonal projections, ensuring reparameterization invariance and reducing catastrophic forgetting.
- It employs structured Fisher approximations like EKFAC, KFAC, and diagonal methods to maintain curvature information while scaling efficiently for deep networks.
- FOPNG is applied in continual and large-batch learning settings, where it improves convergence speed, preserves previous task performance, and enhances overall model robustness.
Fisher-Orthogonal Projected Natural Gradient Descent (FOPNG) is a class of optimization algorithms that merges the principles of natural gradient descent, Fisher information–aware preconditioning, and orthogonal projection techniques. Its defining hallmark is the explicit use of the Fisher-Riemannian geometric structure to inform both step directions and subspace projections, yielding update schemes that are invariant under model reparametrization, offer theoretically sound mitigation of catastrophic forgetting, and deliver improved conditioning and robustness in large-scale or sequential learning settings. FOPNG methods have been developed independently across several lines of research, including continual learning, large-batch optimization, and adaptive representation training.
1. Information Geometry and the Natural Gradient
The foundation of FOPNG is the information geometry of parameterized families of probability distributions, where a Riemannian metric is induced by the Fisher information matrix. Given a neural network model , the Fisher information matrix is defined as:
The associated natural gradient for a loss function is
which gives the steepest-descent direction on the manifold of output distributions under the Fisher metric. This direction is invariant under smooth, bijective reparametrization and often accelerates convergence relative to the Euclidean gradient (Garg et al., 19 Jan 2026, Yadav et al., 24 Aug 2025).
2. Fisher-Orthogonal Projection: Algorithms and Closed-Form Updates
FOPNG augments natural gradient descent with an orthogonal projection step performed in the Fisher metric. In continual learning settings, to prevent interference with previously acquired tasks, FOPNG projects the current natural gradient onto the Fisher-orthogonal complement of stored natural gradients from previous tasks:
where , is the set of stored task gradients, and the inner product is .
A general constrained optimization formulation, targeting both trust region and orthogonality constraints, leads to the closed-form FOPNG update (Garg et al., 19 Jan 2026):
- Let be the matrix of previous task gradients and the Fisher matrix of old tasks.
- For a new task with Fisher and loss gradient :
This projected direction both respects a trust-region constraint in the Fisher norm and excises all components in directions associated to previously learned tasks (Garg et al., 19 Jan 2026).
In large-batch settings, the key geometric operation instead combines an average gradient and a Fisher-orthogonalized variance direction , constructed by splitting batches and computing (Lu et al., 19 Aug 2025):
where , and the update direction is .
3. Efficient Fisher Preconditioning: EKFAC, KFAC, and Whitening
Direct manipulation of the full Fisher matrix is computationally intractable for modern neural networks. FOPNG implementations employ a variety of structured or diagonal approximations:
- EKFAC/KFAC: Layerwise Kronecker-factored approximations, sometimes diagonalized in a joint eigenbasis, enable tractable inversion and preconditioning. For block , preconditioning follows
typically with damping (Yadav et al., 24 Aug 2025, Lu et al., 19 Aug 2025).
- Diagonal Fisher: In continual learning applications, the empirical diagonal Fisher
is leveraged for memory and computation (Garg et al., 19 Jan 2026).
- PRONG: Whitening methods orthonormalize per-layer activations under the Fisher metric, transforming coordinates so that the Fisher matrix is close to block-diagonal identity, thus making the natural gradient effectively Euclidean in the whitened parameterization (Desjardins et al., 2015).
4. Theoretical Guarantees: Orthogonality, Descent, and Invariance
FOPNG’s update directions possess a suite of desirable properties:
- Descent Guarantee: The projected natural gradient remains a valid descent direction under the Fisher metric, i.e.,
By removing components in the span of previous gradients, performance on earlier tasks remains unchanged to first order (Yadav et al., 24 Aug 2025, Garg et al., 19 Jan 2026).
- Reparameterization Invariance: By construction, all objectives and constraints are expressed in Fisher inner products, guaranteeing update invariance under smooth reparameterizations of the network parameters (Garg et al., 19 Jan 2026).
- Preservation of Old Task Outputs: For continual learning, the projection in Fisher space ensures that the first-order change in KL divergence for previous tasks vanishes, maintaining past performance (Garg et al., 19 Jan 2026).
- Curvature Information: In large-batch regimes, variance corrections by Fisher-orthogonal projectors mitigate the loss of curvature information due to extreme damping in standard KFAC or NGD, preserving second-order scaling (Lu et al., 19 Aug 2025).
5. Variants and Practical Implementations
Several concrete instantiations of the FOPNG paradigm have been developed, each tailored to a specific optimization challenge:
| Variant/Method | Setting | Fisher Approximation | Projection Basis |
|---|---|---|---|
| ONG/FOPNG | Continual learning | EKFAC (block; KFAC) | Stored natural grad. |
| FOPNG (large-batch) | Large-batch optimization | KFAC | Batch-var orthogonal |
| PRONG | General | Layerwise whitening | ZCA whitening |
| FOPNG-diag | Continual learning | Diagonal empirical | Past task gradients |
Continual learning variants maintain a buffer of previous task gradients and periodically update an empirical Fisher (either retaining or exponentially averaging Fisher estimates). FOPNG in large-batch scenarios adaptively computes a layerwise mixing coefficient and periodically updates the KFAC-based curvature estimates. In practice, diagonal or structured Fisher approximations allow scaling to modern network sizes: per-batch complexity is for buffer size , while Fisher/KFAC updates have the same amortized cost profile as their underlying natural gradient variants (Yadav et al., 24 Aug 2025, Garg et al., 19 Jan 2026, Lu et al., 19 Aug 2025, Desjardins et al., 2015).
6. Empirical Performance and Benchmarking
Benchmarks across continual learning and large-batch regimes reveal distinctive performance trends for FOPNG variants:
- Continual Learning (Garg et al., 19 Jan 2026, Yadav et al., 24 Aug 2025):
- On Split-MNIST, Split-CIFAR10, and Rotated-MNIST, FOPNG and its PreFisher variant outperform EWC, OGD, and vanilla natural gradient by 3–7% in average accuracy, with superior retention of prior task performance.
- On Permuted-MNIST, FOPNG underperforms slightly, likely due to the highly out-of-distribution character of the task sequence.
- FOPNG exhibits robustness to hyperparameter choices, including the Fisher EMA weight and regularization .
- Large-Batch Training (Lu et al., 19 Aug 2025):
- FOPNG delivers faster convergence and improved generalization on CIFAR-10, ImageNet-100, ImageNet-1K, and long-tailed CIFAR-LT, surpassing both SGD/AdamW and conventional KFAC in time-to-target accuracy and error reduction.
- Only FOPNG maintained effectiveness in the extreme large-batch regime (e.g., ).
- Activation Whitening (PRONG) (Desjardins et al., 2015):
- Dramatically improves Fisher conditioning (e.g., condition number reduction from ≈200 to ≈10), which yields up to 10× fewer required updates for convergence on autoencoder/MNIST tasks.
- Applied to high-dimensional image classification (e.g., ImageNet), PRONG achieves or exceeds BatchNorm’s test accuracy and convergence speed despite moderate amortized overhead.
Detailed per-task results and ablations confirm FOPNG’s ability to scale, preserve memory, and maintain strong generalization across a spectrum of data regimes.
7. Connections, Limitations, and Theoretical Context
FOPNG unifies the viewpoints of orthogonal projection (OGD) and information geometry, generalizing classical OGD from Euclidean to Fisher space. Unlike projection methods in parameter space, Fisher-orthogonality aligns with the sensitivity of a model’s output distributions, not just raw parameters, providing principled guarantees for both robustness and representational invariance (Garg et al., 19 Jan 2026, Yadav et al., 24 Aug 2025).
While theoretically sound, some instantiations (e.g., ONG on simple MNIST permutations) reveal a trade-off in practice: overly aggressive Fisher preconditioning or orthogonality enforcement can impair adaptation rate in highly variant or out-of-distribution settings (Yadav et al., 24 Aug 2025). In large-scale scenarios, the main computational bottleneck derives from Fisher estimation and matrix operations, which are alleviated through structured and amortized approaches.
Overall, Fisher-Orthogonal Projected Natural Gradient Descent provides an adaptable optimization meta-scheme for scenarios where both geometry and past knowledge constraints must be balanced, and has been substantiated by recent results across continual, large-batch, and deep network training domains (Garg et al., 19 Jan 2026, Yadav et al., 24 Aug 2025, Lu et al., 19 Aug 2025, Desjardins et al., 2015).