Papers
Topics
Authors
Recent
Search
2000 character limit reached

Connecting Parameter Magnitudes and Hessian Eigenspaces at Scale using Sketched Methods

Published 20 Apr 2025 in cs.LG and stat.ML | (2504.14701v1)

Abstract: Recently, it has been observed that when training a deep neural net with SGD, the majority of the loss landscape's curvature quickly concentrates in a tiny top eigenspace of the loss Hessian, which remains largely stable thereafter. Independently, it has been shown that successful magnitude pruning masks for deep neural nets emerge early in training and remain stable thereafter. In this work, we study these two phenomena jointly and show that they are connected: We develop a methodology to measure the similarity between arbitrary parameter masks and Hessian eigenspaces via Grassmannian metrics. We identify overlap as the most useful such metric due to its interpretability and stability. To compute overlap, we develop a matrix-free algorithm based on sketched SVDs that allows us to compute over 1000 Hessian eigenpairs for nets with over 10M parameters --an unprecedented scale by several orders of magnitude. Our experiments reveal an overlap between magnitude parameter masks and top Hessian eigenspaces consistently higher than chance-level, and that this effect gets accentuated for larger network sizes. This result indicates that top Hessian eigenvectors tend to be concentrated around larger parameters, or equivalently, that larger parameters tend to align with directions of larger loss curvature. Our work provides a methodology to approximate and analyze deep learning Hessians at scale, as well as a novel insight on the structure of their eigenspace.

Summary

  • The paper establishes a theoretical framework linking binary parameter masks and the top Hessian eigenspace via representation on the Stiefel manifold and Grassmannian metrics.
  • It identifies an interpretable ‘overlap’ metric that quantifies cosine similarity between the subspaces, significantly exceeding random chance.
  • It introduces SEIGH, a scalable, matrix-free algorithm using sketched SVD that approximates top eigenpairs with minimal Hessian-vector products, validated on small to large-scale networks.

This paper investigates the connection between two phenomena observed early in the training of deep neural networks: the emergence and stabilization of effective parameter magnitude masks (used for pruning) and the crystallization and stabilization of the top eigenspace of the loss Hessian (which captures most of the loss curvature).

The core contributions are:

  1. Connecting Masks and Eigenspaces: The paper establishes a theoretical framework to compare binary parameter masks (selecting the top-kk parameters by magnitude) and the top-kk Hessian eigenspace. It shows that both can be represented as rank-kk orthogonal matrices residing on the same Stiefel manifold. This allows their spans (subspaces) to be compared using Grassmannian metrics.
  2. Identifying the Overlap Metric: After reviewing several Grassmannian metrics (like geodesic distance, chordal norm, projection norm, Fubini-Study), the paper identifies the overlap metric as the most suitable for this comparison.

    $\overlap(_1, _2) = \frac{1}{k} \lVert _1^T _2 \rVert_F^2 \in [0, 1]$

    The overlap metric is interpretable (measuring cosine similarity between spans), stable across dimensions, related to other common metrics like Intersection over Union (IoU) and Hamming distance, and has a simple, theoretically grounded random baseline expectation of kD\frac{k}{D} (where kk is the subspace dimension and DD is the total number of parameters).

  3. Scalable Hessian Eigendecomposition (SEIGH): Computing the overlap requires the top-kk eigenbasis of the Hessian, which is computationally infeasible for large networks using standard methods. To address this, the paper develops SEIGH (Sketched EIGHendecomposition), a matrix-free algorithm based on sketched Singular Value Decomposition (SVD) techniques. SEIGH leverages the Hessian's symmetry and uses randomized measurements (Hessian-vector products) to approximate the top-kk eigenpairs efficiently.
    • Algorithm: SEIGH adapts the single-pass sketched SVD algorithm from Tropp et al. (2019) [(Armano et al., 2017), updated in (Collett et al., 2019)] for Hermitian matrices. It requires only O(k)O(k) Hessian-vector products (HVPs) and O(Dk)O(Dk) memory, making it scalable. It uses uncorrelated inner and outer random measurement matrices and oversampling for numerical stability and accuracy.
    • Implementation: The algorithm involves performing noutern_{outer} HVPs with a random matrix ΩO\Omega_O, performing ninnern_{inner} HVPs with another random matrix ΩI\Omega_I, orthogonalizing the results of the outer products via QR decomposition to get a basis QQ, and then solving a smaller nouter×noutern_{outer} \times n_{outer} eigenproblem involving QQ and the inner products.
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      
      # Simplified SEIGH Sketch
      function SEIGH(Hessian_HVP, D, k, n_outer, n_inner):
        # Generate random matrices
        Omega_O = random_matrix(D, n_outer)
        Omega_I = random_matrix(D, n_inner)
      
        # Perform measurements (HVPs)
        Y_O = Hessian_HVP(Omega_O)  # D x n_outer
        Y_I_products = Omega_I^T @ Y_O # n_inner x n_outer (approximation)
      
        # Orthogonalize outer measurements
        Q, _ = qr_decomposition(Y_O) # Q is D x n_outer
      
        # Solve core eigenproblem (simplified view)
        # Actual SEIGH involves solving least squares with pseudo-inverses
        Core_Matrix = Q^T @ Hessian_HVP(Q) # n_outer x n_outer (approximation)
        eigenvalues, eigenvectors_core = eigh(Core_Matrix)
      
        # Recover top-k eigenvectors of H
        eigenvectors_H = Q @ eigenvectors_core[:, :k] # D x k
        eigenvalues_H = eigenvalues[:k]
      
        return eigenvalues_H, eigenvectors_H
  4. Experimental Validation:
    • Small Scale (MNIST): On a small network (≈\approx 7k parameters) trained on downsampled MNIST, the paper first verifies the early crystallization phenomena for both magnitude masks and Hessian eigenspaces. It then computes the exact overlap and shows that the SEIGH approximation closely tracks it. The measured overlap is consistently higher than the random chance baseline (kD\frac{k}{D}).
    • Large Scale (CIFAR-10, CIFAR-100, ImageNet): Using SEIGH, the study computes the approximate overlap for larger networks (up to ResNet-18 with >>11M parameters) and larger kk (up to 1500). The results confirm that the overlap between magnitude masks and top Hessian eigenspaces is substantially higher than random chance across different models, datasets, training stages, and values of kk. This effect is more pronounced for larger networks, with the overlap being up to 1000 times higher than the baseline for ResNet-18 on ImageNet.

Key Finding: There is a significant, non-random alignment between the parameters with the largest magnitudes and the directions of highest curvature (top Hessian eigenvectors) in deep neural networks, especially early in training and for larger models.

Practical Implications:

  • Understanding Network Structure: Provides insight into the relationship between parameter importance (magnitude) and loss landscape geometry (curvature).
  • Algorithm Development: The connection suggests potential for using cheap parameter magnitude information to approximate expensive Hessian properties for tasks like:
    • Optimization: Designing adaptive optimizers.
    • Pruning: Developing more principled pruning methods beyond simple magnitude.
    • Uncertainty Quantification: Improving methods like Laplace approximation.
  • Analysis Tool: SEIGH enables Hessian spectral analysis at unprecedented scales (k>1000k > 1000 eigenpairs for models with >107> 10^7 parameters).

Limitations: While SEIGH scales much better than previous methods, computing thousands of HVPs and storing kk dense vectors still presents a computational challenge for extremely large models. The paper focuses on analysis rather than providing a ready-to-use downstream application incorporating this finding.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.