Hessian Diagonal Approximation

Updated 3 February 2026

Hessian Diagonal Approximation is a technique that estimates curvature along each coordinate by approximating the diagonal of the Hessian, providing scalable O(n) alternatives for large-scale optimization.
Various methods—including deterministic backpropagation, stochastic estimation, structured splitting, derivative-free, and sketching approaches—offer trade-offs between bias, variance, and computational cost.
These approximations are crucial for adaptive learning rates, decentralized optimization, quantization in neural networks, and efficient preconditioning in inverse problems and PDE-constrained systems.

A Hessian diagonal approximation is a family of techniques for estimating the diagonal elements of the Hessian matrix $\nabla^2 f(x)$ associated with a scalar objective $f:\mathbb{R}^n\to\mathbb{R}$ . The diagonal captures curvature magnitudes along each coordinate direction and is instrumental for constructing effective preconditioners, adaptive step size schemes, or quantization objectives within second-order or quasi-Newton optimization frameworks. Exact Hessian or Hessian-diagonal computation is intractable for large-scale problems due to $O(n^2)$ cost; diagonal approximations offer scalable $O(n)$ alternatives applicable across classical numerical optimization, deep learning, inverse problems, distributed and federated systems, and post-training quantization. Numerous computational schemes exist, including deterministic, stochastic, matrix-free, and sketching-based approaches. Their theoretical foundation, algorithmic structures, versatility, and empirical efficacy are summarized below.

1. Theoretical Foundations and Motivations

The second-order derivative (Hessian) provides local curvature information for $f(x)$ . For optimization and inverse problems, diagonal entries $[\nabla^2 f(x)]_{ii}$ facilitate effective scaling and preconditioning along individual coordinates, mitigating ill-conditioning and aligning update directions with the local objective geometry. In neural networks, diagonal Hessian approximations are crucial for adaptive second-order optimizers (e.g., Adam, Apollo, AdaHessScale) and scalable quantization criteria. For distributed and federated optimization, diagonal structures support decentralized preconditioning with minimal communication overhead. In large-scale scientific computing, the low-rank-plus-diagonal structure of Hessians has motivated recent sketching-based methods for high-fidelity diagonal recovery.

2. Key Methodologies for Diagonal Approximation

A broad taxonomy of Hessian-diagonal approximation techniques is outlined below.

Category	Principle / Mechanism	Archetype References
Deterministic Backprop	Drop off-diagonals in analytic recursions; exact last-layer plug-in	Becker-LeCun BL89, HesScale (Elsayed et al., 2022, Elsayed et al., 2024)
Stochastic Estimation	Hutchinson or CP: probe with random vectors, average $v \odot (Hv)$	Martens–Sutskever (Martens et al., 2012), BL89, Hutchinson
Structured Splitting	Polarization/differentiate only single-inclusion; secant or quasi-Newton diagonal updates	(Watson et al., 2021, Mannel et al., 2024, Ma, 2020, Awwal et al., 2020)
Derivative-Free	Finite-difference on symmetric sample sets, centered/forward schemes	(Hare et al., 2023, Jarry-Bolduc, 2021)
Quantization-Aware	Three-part diagonal (element/kernel/channel) for Hessian in neural quantization	(Guo et al., 2022)
Sketching with LoRD	Simultaneous low-rank and diagonal fit from matrix–vector products	(Fernandez et al., 28 Sep 2025)

2.1 Deterministic Analytic Recursion Methods

Classical approaches (e.g., Becker–LeCun 1989) propagate “curvature” through the computational graph by approximating layerwise backward recursions and discarding off-diagonal couplings in the Hessian. Improved schemes such as HesScale directly plug in the analytically exact last-layer diagonal (for cross-entropy/softmax) and propagate deterministic scalar “square-and-sum” recursions for each layer, yielding higher fidelity especially in deep networks (Elsayed et al., 2022, Elsayed et al., 2024). Neural second-order optimizers, e.g., AdaHesScale, substitute these diagonal recursions for gradient squares in Adam-type schemes.

2.2 Stochastic and Curvature Propagation Methods

Stochastic approaches, typified by the Hutchinson estimator, use random Rademacher or Gaussian vectors $v$ and average $v\odot(Hv)$ over i.i.d. samples to obtain an unbiased diagonal estimate. Curvature Propagation (CP) (Martens et al., 2012) extends this to arbitrary computational graphs, propagating random probes through both forward and backwards passes to realize unbiased rank-1 estimators of the Hessian (and hence its diagonal). Multiple samples improve accuracy at increased computational cost; CP achieves lower diagonal-variance than direct stochastic Hessian-vector products.

2.3 Structured and Physics-Informed Splitting

In PDE-based inverse problems, as in polarization-tensor expansions, the physical structure leads naturally to diagonal Hessian approximations (Watson et al., 2021). The approach discards off-diagonal contributions corresponding to multiple-scattering and interaction terms, focusing only on single-element responses and saturation, which is analytically tractable for well-separated inclusions. Secant-type updates in iterative solvers (e.g., L-BFGS with diagonal scaling (Mannel et al., 2024), Apollo (Ma, 2020), ASDH (Awwal et al., 2020)) impose diagonalization via least-squares constraints aligned to local secant or gradient change information, with problem-adapted safeguards for positive-definiteness and regularity. Distributed and federated second-order methods split the Hessian into diagonal and off-diagonal parts, invert the former exactly, and approximate the latter via local corrections (Bajovic et al., 2015).

2.4 Derivative-Free and Sampling-Based Methods

For black-box or derivative-free optimization, finite-difference centered/forward designs recover the diagonal by evaluating $f$ at specially structured points around $x^0$ (Hare et al., 2023). Generalized centered simplex gradient (GCSG) and centered simplex Hessian diagonal (CSHD) schemes fit quadratic models to symmetric sample sets; with a “lonely” matrix (basis), second-order accuracy $\mathcal{O}(h^2)$ is achieved for the diagonal entries (Jarry-Bolduc, 2021). The cost is $2n$ function evaluations, unavoidably $O(n)$ .

2.5 Quantization-Aware and Data-Free Objectives

For data-free quantization, the Hessian of trained networks is approximated by a sum of three positive semidefinite diagonal components—element-wise, kernel-wise, and channel-wise—enabling the formulation of quadratic objectives solvable without backpropagation or calibration data (Guo et al., 2022). Hardware-amenable schemes with staged “flipping” algorithms propagate per-layer and per-channel constraints ensuring stable quantization with sub-second inference-only deployment.

2.6 Low-Rank Plus Diagonal Sketching

In high-dimensional settings with expensive linear operator access (deep learning or scientific simulation), the Hessian is often well-approximated by a low-rank plus diagonal (“LoRD”) structure. SKETCHLORD (Fernandez et al., 28 Sep 2025) constructs both low-rank and diagonal components by solving a nuclear-norm-regularized least-squares fit over randomly sketched Hessian–vector products. The diagonal entries are then extracted via a closed-form formula from the residual-matrix sketch, yielding significant accuracy improvement over pure Hutchinson estimators at nearly optimal cost for the LoRD model.

3. Algorithmic Frameworks and Practical Integration

The practical realization of diagonal Hessian approximations depends on context:

Neural network optimization: Diagonal recursions (HesScale/AdaHesScale) are integrated as adaptive learning-rate scaling in Adam-like update rules (Elsayed et al., 2022, Elsayed et al., 2024). Quasi-Newton methods such as Apollo update per-coordinate baselines based on weak secant constraints and rectified diagonalization (Ma, 2020). Safeguarding (rectification/clipping) ensures positive-definite scaling.
Inverse and imaging problems: Structured L-BFGS or ASDH schemes replace the standard scalar seed matrix with an iteratively updated diagonal, tuned via least-squares-projected secant conditions, and admit global convergence under mild regularity (Mannel et al., 2024, Awwal et al., 2020).
Distributed and federated learning: DQN-style algorithms harness locally computed diagonal Hessians with optional off-diagonal corrections, maintaining scalability and global–local convergence in networked architectures (Bajovic et al., 2015).
Error-compensation for gradient compression: Compressed SGD is combined with diagonal Hessian-aided correction to eliminate error floors and accelerate convergence, delivering near–full Hessian EC performance at a fraction of the cost (Khirirat et al., 2019).
Derivative-free/trust-region optimization: The diagonal approximation is seamlessly available from (centered) difference evaluations at no extra point cost if symmetric sampling is used (Hare et al., 2023, Jarry-Bolduc, 2021).

4. Theoretical Guarantees and Empirical Evidence

Strong convergence properties are established in diverse regimes:

Unbiasedness: CP (Martens et al., 2012), Hutchinson, and BL89 estimators provide unbiased diagonal approximations, with variance scaling as $1/\sqrt{K}$ for $K$ samples. Diagonal approximations via analytic backprop (HesScale) are deterministic but systematically biased, though this bias is empirically small.
Convergence: Secant-based quasi-Newton diagonals (Apollo, ASDH, structured L-BFGS) ensure global convergence under mild assumptions. Error-compensated compressed SGD with diagonal Hessian achieves linear convergence under strong convexity and bounded error in diagonal estimation (Khirirat et al., 2019).
Empirical performance: HesScale exhibits lower $L^1$ -error to ground-truth diagonals than both GGN and stochastic Monte Carlo approaches, yielding faster optimizer convergence in practical deep learning tasks (Elsayed et al., 2022, Elsayed et al., 2024). Diagonal preconditioners markedly reduce both iterations and CPU time in imaging inverse problems and PDE-constrained optimization (Mannel et al., 2024).
Sketching LoRD structure: SKETCHLORD achieves exact recovery under the idealized LoRD model for $p \geq k+4$ random projections, with empirical superiority to sequential diagonal or low-rank techniques in both error and computational efficiency (Fernandez et al., 28 Sep 2025).

5. Applications and Domain-Specific Instantiations

Significant applications of Hessian-diagonal approximation span:

Second-order neural network optimizers: Adaptive step-size schemes (AdaHesScale, Apollo) and improved convergence in supervised learning, reinforcement learning, and LSTMs (Elsayed et al., 2022, Elsayed et al., 2024, Ma, 2020).
PDE and inverse problems: Electrical impedance tomography and general PDE-constrained parameter estimation with polarization tensor–based diagonals (Watson et al., 2021), inverse imaging and registration via structured L-BFGS diagonal seeds (Mannel et al., 2024).
Federated/distributed optimization: DQN frameworks deliver Newton-like updates with minimal communication via diagonal inversion (Bajovic et al., 2015).
Gradient compression with error compensation: Linear convergence and elimination of error-floor via diagonal Hessian assistance (Khirirat et al., 2019).
Post-training quantization: Hessian diagonal decompositions underpin fast, data-free quantization algorithms with battery-scalable, inference-only deployment (Guo et al., 2022).
High-dimensional operator sketching: LoRD sketching provides accurate Hessian diagonals from Hessian–vector products in regimes inaccessible to either pure low-rank or diagonal approximations (Fernandez et al., 28 Sep 2025).

6. Comparative Analysis, Limitations, and Extensions

Diagonal Hessian approximation balances computational tractability and curvature fidelity. Deterministic analytic recursions (HesScale) outperform pure stochastic methods but introduce bias; randomized CP/Hutchinson estimators are unbiased but often require many samples for acceptable variance in high curvature-coupling settings (Martens et al., 2012, Elsayed et al., 2022). Secant-based and structured L-BFGS diagonals adapt well to nonconvex and inverse problems, while derivative-free diagonals offer $O(n)$ scaling with $O(h^2)$ accuracy by proper sampling (Hare et al., 2023, Jarry-Bolduc, 2021). LoRD sketching unlocks the regime where Hessians are neither diagonally dominant nor low-rank alone (Fernandez et al., 28 Sep 2025).

Potential limitations include sensitivity to parameterization (standard basis, wavelets, adaptive bases), loss of curvature fidelity if off-diagonal coupling is strong, and the need for careful safeguarding in nonconvex/stochastic settings. Future work targets structured diagonal approximations in alternative bases, hybrid (block-diagonal plus diagonal) schemes, and data-driven or learning-based diagonal estimators.

References

"A polarization tensor approximation for the Hessian in iterative solvers for non-linear inverse problems" (Watson et al., 2021)
"HesScale: Scalable Computation of Hessian Diagonals" (Elsayed et al., 2022)
"Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learning" (Elsayed et al., 2024)
"Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization" (Ma, 2020)
"A structured L-BFGS method with diagonal scaling and its application to image registration" (Mannel et al., 2024)
"Iterative algorithm with structured diagonal Hessian approximation for solving nonlinear least squares problems" (Awwal et al., 2020)
"Estimating the Hessian by Back-propagating Curvature" (Martens et al., 2012)
"A matrix algebra approach to approximate Hessians" (Hare et al., 2023)
"Approximating the diagonal of a Hessian: which sample set of points should be used" (Jarry-Bolduc, 2021)
"Newton-like method with diagonal correction for distributed optimization" (Bajovic et al., 2015)
"Compressed Gradient Methods with Hessian-Aided Error Compensation" (Khirirat et al., 2019)
"SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation" (Guo et al., 2022)
"Sketching Low-Rank Plus Diagonal Matrices" (Fernandez et al., 28 Sep 2025)