Papers
Topics
Authors
Recent
Search
2000 character limit reached

HesScaleGN: Scalable Diagonal GGN Approximation

Updated 8 February 2026
  • HesScaleGN is a scalable, backpropagation-based approach that approximates the diagonal of the Generalized Gauss–Newton matrix to capture essential curvature for efficient second-order optimization.
  • It employs a layerwise diagonal recursion inspired by early curvature propagation methods, achieving low approximation errors with only about 1.25× the cost of standard gradient backpropagation.
  • Empirical evaluations demonstrate that HesScaleGN enhances adaptive step-size control and stability across diverse deep learning and reinforcement learning applications.

HesScaleGN is a scalable, backpropagation-based approach for approximating the diagonal of the Generalized Gauss–Newton (GGN) matrix—an established positive-semidefinite proxy for the Hessian—in neural network training and analysis. It is principally designed to combine computational efficiency (matching standard backpropagation) with curvature sensitivity suitable for second-order optimization, adaptive step-size control, and Hessian-based diagnostics at large scale. HesScaleGN is motivated by the need for scalable, per-parameter curvature approximations with minimal bias, and it overcomes the prohibitive costs of full Hessian formation or stochastic trace estimation.

1. Theoretical Foundations and Derivation

The HesScaleGN methodology builds upon classic curvature propagation schemes, specifically the layerwise diagonal recursion inspired by Becker & LeCun (1989). For a multilayer feedforward network, the full Hessian H(θ)H(\theta) is generally dense and expensive to form or invert. HesScaleGN restricts attention to the diagonal of the GGN matrix, which, for typical losses and architectures, captures essential positive curvature while avoiding off-diagonal structure.

Given network activations aa_\ell and weights WW_\ell, the GGN diagonal at the top (output) layer with softmax + cross-entropy is exactly

GL[i]=qi(1qi)G_L[i] = q_i (1 - q_i)

where q=softmax(aL)q = \mathrm{softmax}(a_L) are predicted probabilities. For lower layers, the recursion

G[i]=σ(a[i])2k=1n+1W+1[k,i]2G+1[k]G_\ell[i] = \sigma'(a_\ell[i])^2 \sum_{k=1}^{n_{\ell+1}} W_{\ell+1}[k,i]^2 G_{\ell+1}[k]

propagates the diagonal curvature backward. The diagonal of the weight matrix at layer \ell is then given by the outer product

diag(2LW2)i,j=G[i]h1[j]2.\operatorname{diag}\left(\frac{\partial^2 \mathcal{L}}{\partial W_\ell^2}\right)_{i,j} = G_\ell[i] \cdot h_{\ell-1}[j]^2.

This omits the second-derivative-of-activation (σ()\sigma''(\cdot)) terms present in the full Hessian. As a result, HesScaleGN can be interpreted as an exact backpropagation of GGN diagonals, guaranteeing positive semi-definite curvature estimates.

2. Implementation Workflow and Computational Complexity

HesScaleGN runs a single backward pass tightly integrated with ordinary gradient backpropagation. Key steps per batch include:

  1. Forward propagate, storing all required activations aa_\ell, hh_\ell.
  2. Compute top-layer quantities (probabilities, gradients, GGN diagonal).
  3. Backpropagate both gradients and GGN diagonals via the recurrence for each layer.
  4. For each WW_\ell, form the weight-diagonal by computing G(h12)G_\ell \otimes (h_{\ell-1}^2).

Pseudocode is structured as follows:

1
2
3
4
5
6
7
8
9
q = softmax(a_L)
_L = -(y - q)
G_L = q * (1 - q)
diag_W_L = outer(G_L, h_{L-1}^2)
for l in range(L-1, 0, -1):
    _l = (W_{l+1}^T _{l+1}) * σ'(a_l)
    temp = (W_{l+1}^2) @ G_{l+1}
    G_l = (σ'(a_l)^2) * temp
    diag_W_l = outer(G_l, h_{l-1}^2)

All vector, matrix, and elementwise operations are O(n)O(n), where nn is the total parameter count. The total computational cost is approximately 1.25× that of a single standard gradient pass (measured wall-clock) (Castro-Macías et al., 2024, Elsayed et al., 2022). Required storage is O(n)O(n), dominated by the need to hold both gradients and curvature vectors.

3. Approximation Quality, Error Analysis, and Empirical Justification

Empirical assessments demonstrate that HesScaleGN achieves consistently low 1\ell_1-distance between estimated and true Hessian diagonals. In controlled experiments, normalized errors for HesScaleGN are within 10%10\% of the exact GGN diagonal and outperform alternative scalable methods, such as diagonalized block-KFAC, g2^2 (squared gradients), MC-based AdaHessian, and full Hessian recursions that drop off-diagonal terms (Elsayed et al., 2022, Elsayed et al., 2024).

A central justification is the high “diagonality” p(A)=diag(A)F2/AF2p(A) = \|\operatorname{diag}(A)\|_F^2 / \|A\|_F^2 of actual pre-activation Hessians encountered in trained networks, with empirical p0.8p \approx 0.8–$0.9$, validating the omission of off-diagonals in typical models (Elsayed et al., 2024).

4. Practical Applications in Optimization and Reinforcement Learning

HesScaleGN is particularly effective as a drop-in replacement for the squared-gradient moving average in adaptive optimizers, e.g., Adam. The AdaHesScaleGN algorithm maintains moment estimates with Hessian-diagonal statistics: vt=β2vt1+(1β2)h^t2,v_t = \beta_2 v_{t-1} + (1-\beta_2) \hat h_t^2, yielding preconditioned, curvature-aware updates at O(n)O(n) cost.

In deep reinforcement learning, HesScaleGN enables both improved optimizer convergence and robust step-size scaling via trust-region-like global step size adaptation: αtmin(α,Δ/utH^ut),\alpha_t \leftarrow \min\Big(\alpha, \Delta / \sqrt{u_t^\top \widehat H u_t}\Big), where H^\widehat H is the HesScaleGN diagonal. This approach confers near-complete insensitivity to initial step-size in actor-critic frameworks and stabilizes training in real-robot settings by preventing catastrophic collapse (Elsayed et al., 2024).

5. Empirical Performance and Scalability

Extensive results on supervised (CIFAR-10/100, DeepOBS benchmarks) and reinforcement learning platforms (MuJoCo, real-world UR-Reacher) show that AdaHesScaleGN consistently achieves faster convergence, greater stability, and often higher final returns than first-order and less-accurate second-order alternatives (e.g., AdaHessian, Hutchinson, BL89 variants). These empirical results hold across a range of architectures (deep MLPs, convolutional networks) and model sizes (Elsayed et al., 2024, Elsayed et al., 2022).

Per-iteration computation is modest: for deep MLPs up to 128 layers and 512 units, update time remains within 1.3×1.3\times Adam. For small and medium networks, overhead is negligible in practical training runs.

6. Foundations for Large-Scale Spectral Analysis

At foundation-model scale, HesScaleGN (and more generally, GGN-diagonal-based methods) are fundamental tools for Hessian spectral analysis, trace estimation, and curvature diagnostics under sharded training. Finite-difference Hessian-vector products and stochastic Lanczos quadrature, compatible with Fully Sharded Data Parallelism, have enabled the first spectral density measurements on models up to 100B parameters. Critical findings include that widely used block-diagonal Hessian approximations (e.g., by K-FAC) can incur order-one relative errors and spurious alignment at LLM scale, underscoring the necessity of faithful diagonal or full-operator probes (Granziol et al., 31 Jan 2026).

7. Limitations, Implications, and Relationship to Other Methods

HesScaleGN provides a high-quality, positive semi-definite diagonal approximation at minimal cost and is exact for piecewise-linear activations. When nonlinearity second derivatives are important (e.g., smooth activations), small bias remains relative to true Hessians—yet, as established in layerwise analyses and empirical tests, this is typically negligible for optimization and practitioners’ requirements. Unlike full Hessian or MC-based diagonal estimators, HesScaleGN avoids O(n2)O(n^2) time and memory and does not require stochastic averaging.

No direct connection exists between HesScaleGN and methods focused on scalable graph neural networks (e.g., ScaleGNN or hypothetical "HesScaleGN" GNN variants). The nomenclature "HesScaleGN" is specific to scalable diagonal Gauss–Newton curvature for parameter-space optimization and is not used in GNN system design (Li et al., 22 Apr 2025).


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HesScaleGN.