HesScaleGN: Scalable Diagonal GGN Approximation
- HesScaleGN is a scalable, backpropagation-based approach that approximates the diagonal of the Generalized Gauss–Newton matrix to capture essential curvature for efficient second-order optimization.
- It employs a layerwise diagonal recursion inspired by early curvature propagation methods, achieving low approximation errors with only about 1.25× the cost of standard gradient backpropagation.
- Empirical evaluations demonstrate that HesScaleGN enhances adaptive step-size control and stability across diverse deep learning and reinforcement learning applications.
HesScaleGN is a scalable, backpropagation-based approach for approximating the diagonal of the Generalized Gauss–Newton (GGN) matrix—an established positive-semidefinite proxy for the Hessian—in neural network training and analysis. It is principally designed to combine computational efficiency (matching standard backpropagation) with curvature sensitivity suitable for second-order optimization, adaptive step-size control, and Hessian-based diagnostics at large scale. HesScaleGN is motivated by the need for scalable, per-parameter curvature approximations with minimal bias, and it overcomes the prohibitive costs of full Hessian formation or stochastic trace estimation.
1. Theoretical Foundations and Derivation
The HesScaleGN methodology builds upon classic curvature propagation schemes, specifically the layerwise diagonal recursion inspired by Becker & LeCun (1989). For a multilayer feedforward network, the full Hessian is generally dense and expensive to form or invert. HesScaleGN restricts attention to the diagonal of the GGN matrix, which, for typical losses and architectures, captures essential positive curvature while avoiding off-diagonal structure.
Given network activations and weights , the GGN diagonal at the top (output) layer with softmax + cross-entropy is exactly
where are predicted probabilities. For lower layers, the recursion
propagates the diagonal curvature backward. The diagonal of the weight matrix at layer is then given by the outer product
This omits the second-derivative-of-activation () terms present in the full Hessian. As a result, HesScaleGN can be interpreted as an exact backpropagation of GGN diagonals, guaranteeing positive semi-definite curvature estimates.
2. Implementation Workflow and Computational Complexity
HesScaleGN runs a single backward pass tightly integrated with ordinary gradient backpropagation. Key steps per batch include:
- Forward propagate, storing all required activations , .
- Compute top-layer quantities (probabilities, gradients, GGN diagonal).
- Backpropagate both gradients and GGN diagonals via the recurrence for each layer.
- For each , form the weight-diagonal by computing .
Pseudocode is structured as follows:
1 2 3 4 5 6 7 8 9 |
q = softmax(a_L) ∂_L = -(y - q) G_L = q * (1 - q) diag_W_L = outer(G_L, h_{L-1}^2) for l in range(L-1, 0, -1): ∂_l = (W_{l+1}^T ∂_{l+1}) * σ'(a_l) temp = (W_{l+1}^2) @ G_{l+1} G_l = (σ'(a_l)^2) * temp diag_W_l = outer(G_l, h_{l-1}^2) |
All vector, matrix, and elementwise operations are , where is the total parameter count. The total computational cost is approximately 1.25× that of a single standard gradient pass (measured wall-clock) (Castro-Macías et al., 2024, Elsayed et al., 2022). Required storage is , dominated by the need to hold both gradients and curvature vectors.
3. Approximation Quality, Error Analysis, and Empirical Justification
Empirical assessments demonstrate that HesScaleGN achieves consistently low -distance between estimated and true Hessian diagonals. In controlled experiments, normalized errors for HesScaleGN are within of the exact GGN diagonal and outperform alternative scalable methods, such as diagonalized block-KFAC, g (squared gradients), MC-based AdaHessian, and full Hessian recursions that drop off-diagonal terms (Elsayed et al., 2022, Elsayed et al., 2024).
A central justification is the high “diagonality” of actual pre-activation Hessians encountered in trained networks, with empirical –$0.9$, validating the omission of off-diagonals in typical models (Elsayed et al., 2024).
4. Practical Applications in Optimization and Reinforcement Learning
HesScaleGN is particularly effective as a drop-in replacement for the squared-gradient moving average in adaptive optimizers, e.g., Adam. The AdaHesScaleGN algorithm maintains moment estimates with Hessian-diagonal statistics: yielding preconditioned, curvature-aware updates at cost.
In deep reinforcement learning, HesScaleGN enables both improved optimizer convergence and robust step-size scaling via trust-region-like global step size adaptation: where is the HesScaleGN diagonal. This approach confers near-complete insensitivity to initial step-size in actor-critic frameworks and stabilizes training in real-robot settings by preventing catastrophic collapse (Elsayed et al., 2024).
5. Empirical Performance and Scalability
Extensive results on supervised (CIFAR-10/100, DeepOBS benchmarks) and reinforcement learning platforms (MuJoCo, real-world UR-Reacher) show that AdaHesScaleGN consistently achieves faster convergence, greater stability, and often higher final returns than first-order and less-accurate second-order alternatives (e.g., AdaHessian, Hutchinson, BL89 variants). These empirical results hold across a range of architectures (deep MLPs, convolutional networks) and model sizes (Elsayed et al., 2024, Elsayed et al., 2022).
Per-iteration computation is modest: for deep MLPs up to 128 layers and 512 units, update time remains within Adam. For small and medium networks, overhead is negligible in practical training runs.
6. Foundations for Large-Scale Spectral Analysis
At foundation-model scale, HesScaleGN (and more generally, GGN-diagonal-based methods) are fundamental tools for Hessian spectral analysis, trace estimation, and curvature diagnostics under sharded training. Finite-difference Hessian-vector products and stochastic Lanczos quadrature, compatible with Fully Sharded Data Parallelism, have enabled the first spectral density measurements on models up to 100B parameters. Critical findings include that widely used block-diagonal Hessian approximations (e.g., by K-FAC) can incur order-one relative errors and spurious alignment at LLM scale, underscoring the necessity of faithful diagonal or full-operator probes (Granziol et al., 31 Jan 2026).
7. Limitations, Implications, and Relationship to Other Methods
HesScaleGN provides a high-quality, positive semi-definite diagonal approximation at minimal cost and is exact for piecewise-linear activations. When nonlinearity second derivatives are important (e.g., smooth activations), small bias remains relative to true Hessians—yet, as established in layerwise analyses and empirical tests, this is typically negligible for optimization and practitioners’ requirements. Unlike full Hessian or MC-based diagonal estimators, HesScaleGN avoids time and memory and does not require stochastic averaging.
No direct connection exists between HesScaleGN and methods focused on scalable graph neural networks (e.g., ScaleGNN or hypothetical "HesScaleGN" GNN variants). The nomenclature "HesScaleGN" is specific to scalable diagonal Gauss–Newton curvature for parameter-space optimization and is not used in GNN system design (Li et al., 22 Apr 2025).
References:
- "Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learning" (Elsayed et al., 2024)
- "HesScale: Scalable Computation of Hessian Diagonals" (Elsayed et al., 2022)
- "Hessian Spectral Analysis at Foundation Model Scale" (Granziol et al., 31 Jan 2026)