Connecting Parameter Magnitudes and Hessian Eigenspaces at Scale using Sketched Methods

Published 20 Apr 2025 in cs.LG and stat.ML | (2504.14701v1)

Abstract: Recently, it has been observed that when training a deep neural net with SGD, the majority of the loss landscape's curvature quickly concentrates in a tiny top eigenspace of the loss Hessian, which remains largely stable thereafter. Independently, it has been shown that successful magnitude pruning masks for deep neural nets emerge early in training and remain stable thereafter. In this work, we study these two phenomena jointly and show that they are connected: We develop a methodology to measure the similarity between arbitrary parameter masks and Hessian eigenspaces via Grassmannian metrics. We identify overlap as the most useful such metric due to its interpretability and stability. To compute overlap, we develop a matrix-free algorithm based on sketched SVDs that allows us to compute over 1000 Hessian eigenpairs for nets with over 10M parameters --an unprecedented scale by several orders of magnitude. Our experiments reveal an overlap between magnitude parameter masks and top Hessian eigenspaces consistently higher than chance-level, and that this effect gets accentuated for larger network sizes. This result indicates that top Hessian eigenvectors tend to be concentrated around larger parameters, or equivalently, that larger parameters tend to align with directions of larger loss curvature. Our work provides a methodology to approximate and analyze deep learning Hessians at scale, as well as a novel insight on the structure of their eigenspace.

Abstract PDF Upgrade to Chat

Summary

The paper establishes a theoretical framework linking binary parameter masks and the top Hessian eigenspace via representation on the Stiefel manifold and Grassmannian metrics.
It identifies an interpretable ‘overlap’ metric that quantifies cosine similarity between the subspaces, significantly exceeding random chance.
It introduces SEIGH, a scalable, matrix-free algorithm using sketched SVD that approximates top eigenpairs with minimal Hessian-vector products, validated on small to large-scale networks.

This paper investigates the connection between two phenomena observed early in the training of deep neural networks: the emergence and stabilization of effective parameter magnitude masks (used for pruning) and the crystallization and stabilization of the top eigenspace of the loss Hessian (which captures most of the loss curvature).

The core contributions are:

Connecting Masks and Eigenspaces: The paper establishes a theoretical framework to compare binary parameter masks (selecting the top- $k$ parameters by magnitude) and the top- $k$ Hessian eigenspace. It shows that both can be represented as rank- $k$ orthogonal matrices residing on the same Stiefel manifold. This allows their spans (subspaces) to be compared using Grassmannian metrics.
Identifying the Overlap Metric: After reviewing several Grassmannian metrics (like geodesic distance, chordal norm, projection norm, Fubini-Study), the paper identifies the overlap metric as the most suitable for this comparison.

$\overlap(_1, _2) = \frac{1}{k} \lVert _1^T _2 \rVert_F^2 \in [0, 1]$

The overlap metric is interpretable (measuring cosine similarity between spans), stable across dimensions, related to other common metrics like Intersection over Union (IoU) and Hamming distance, and has a simple, theoretically grounded random baseline expectation of $\frac{k}{D}$ (where $k$ is the subspace dimension and $D$ is the total number of parameters).

Scalable Hessian Eigendecomposition (SEIGH): Computing the overlap requires the top-

k

eigenbasis of the Hessian, which is computationally infeasible for large networks using standard methods. To address this, the paper develops SEIGH (Sketched EIGHendecomposition), a matrix-free algorithm based on sketched Singular Value Decomposition (SVD) techniques. SEIGH leverages the Hessian's symmetry and uses randomized measurements (Hessian-vector products) to approximate the top-

k

eigenpairs efficiently.

Algorithm: SEIGH adapts the single-pass sketched SVD algorithm from Tropp et al. (2019) [(Armano et al., 2017), updated in (Collett et al., 2019)] for Hermitian matrices. It requires only $O(k)$ Hessian-vector products (HVPs) and $O(Dk)$ memory, making it scalable. It uses uncorrelated inner and outer random measurement matrices and oversampling for numerical stability and accuracy.

Implementation: The algorithm involves performing

n_{outer}

HVPs with a random matrix

\Omega_O

, performing

n_{inner}

HVPs with another random matrix

\Omega_I

, orthogonalizing the results of the outer products via QR decomposition to get a basis

Q

, and then solving a smaller

n_{outer} \times n_{outer}

eigenproblem involving

Q

and the inner products.

# Simplified SEIGH Sketch
function SEIGH(Hessian_HVP, D, k, n_outer, n_inner):
  # Generate random matrices
  Omega_O = random_matrix(D, n_outer)
  Omega_I = random_matrix(D, n_inner)

  # Perform measurements (HVPs)
  Y_O = Hessian_HVP(Omega_O)  # D x n_outer
  Y_I_products = Omega_I^T @ Y_O # n_inner x n_outer (approximation)

  # Orthogonalize outer measurements
  Q, _ = qr_decomposition(Y_O) # Q is D x n_outer

  # Solve core eigenproblem (simplified view)
  # Actual SEIGH involves solving least squares with pseudo-inverses
  Core_Matrix = Q^T @ Hessian_HVP(Q) # n_outer x n_outer (approximation)
  eigenvalues, eigenvectors_core = eigh(Core_Matrix)

  # Recover top-k eigenvectors of H
  eigenvectors_H = Q @ eigenvectors_core[:, :k] # D x k
  eigenvalues_H = eigenvalues[:k]

  return eigenvalues_H, eigenvectors_H

Experimental Validation:
- Small Scale (MNIST): On a small network ( $\approx$ 7k parameters) trained on downsampled MNIST, the paper first verifies the early crystallization phenomena for both magnitude masks and Hessian eigenspaces. It then computes the exact overlap and shows that the SEIGH approximation closely tracks it. The measured overlap is consistently higher than the random chance baseline ( $\frac{k}{D}$ ).
- Large Scale (CIFAR-10, CIFAR-100, ImageNet): Using SEIGH, the study computes the approximate overlap for larger networks (up to ResNet-18 with $>$ 11M parameters) and larger $k$ (up to 1500). The results confirm that the overlap between magnitude masks and top Hessian eigenspaces is substantially higher than random chance across different models, datasets, training stages, and values of $k$ . This effect is more pronounced for larger networks, with the overlap being up to 1000 times higher than the baseline for ResNet-18 on ImageNet.

Key Finding: There is a significant, non-random alignment between the parameters with the largest magnitudes and the directions of highest curvature (top Hessian eigenvectors) in deep neural networks, especially early in training and for larger models.

Practical Implications:

Understanding Network Structure: Provides insight into the relationship between parameter importance (magnitude) and loss landscape geometry (curvature).
Algorithm Development: The connection suggests potential for using cheap parameter magnitude information to approximate expensive Hessian properties for tasks like:
- Optimization: Designing adaptive optimizers.
- Pruning: Developing more principled pruning methods beyond simple magnitude.
- Uncertainty Quantification: Improving methods like Laplace approximation.
Analysis Tool: SEIGH enables Hessian spectral analysis at unprecedented scales ( $k > 1000$ eigenpairs for models with $> 10^7$ parameters).

Limitations: While SEIGH scales much better than previous methods, computing thousands of HVPs and storing $k$ dense vectors still presents a computational challenge for extremely large models. The paper focuses on analysis rather than providing a ready-to-use downstream application incorporating this finding.