- The paper establishes a theoretical framework linking binary parameter masks and the top Hessian eigenspace via representation on the Stiefel manifold and Grassmannian metrics.
- It identifies an interpretable ‘overlap’ metric that quantifies cosine similarity between the subspaces, significantly exceeding random chance.
- It introduces SEIGH, a scalable, matrix-free algorithm using sketched SVD that approximates top eigenpairs with minimal Hessian-vector products, validated on small to large-scale networks.
This paper investigates the connection between two phenomena observed early in the training of deep neural networks: the emergence and stabilization of effective parameter magnitude masks (used for pruning) and the crystallization and stabilization of the top eigenspace of the loss Hessian (which captures most of the loss curvature).
The core contributions are:
- Connecting Masks and Eigenspaces: The paper establishes a theoretical framework to compare binary parameter masks (selecting the top-k parameters by magnitude) and the top-k Hessian eigenspace. It shows that both can be represented as rank-k orthogonal matrices residing on the same Stiefel manifold. This allows their spans (subspaces) to be compared using Grassmannian metrics.
- Identifying the
Overlap Metric: After reviewing several Grassmannian metrics (like geodesic distance, chordal norm, projection norm, Fubini-Study), the paper identifies the overlap metric as the most suitable for this comparison.
$\overlap(_1, _2) = \frac{1}{k} \lVert _1^T _2 \rVert_F^2 \in [0, 1]$
The overlap metric is interpretable (measuring cosine similarity between spans), stable across dimensions, related to other common metrics like Intersection over Union (IoU) and Hamming distance, and has a simple, theoretically grounded random baseline expectation of Dk​ (where k is the subspace dimension and D is the total number of parameters).
- Scalable Hessian Eigendecomposition (SEIGH): Computing the
overlap requires the top-k eigenbasis of the Hessian, which is computationally infeasible for large networks using standard methods. To address this, the paper develops SEIGH (Sketched EIGHendecomposition), a matrix-free algorithm based on sketched Singular Value Decomposition (SVD) techniques. SEIGH leverages the Hessian's symmetry and uses randomized measurements (Hessian-vector products) to approximate the top-k eigenpairs efficiently.
- Algorithm: SEIGH adapts the single-pass sketched SVD algorithm from Tropp et al. (2019) [(Armano et al., 2017), updated in (Collett et al., 2019)] for Hermitian matrices. It requires only O(k) Hessian-vector products (HVPs) and O(Dk) memory, making it scalable. It uses uncorrelated inner and outer random measurement matrices and oversampling for numerical stability and accuracy.
- Implementation: The algorithm involves performing nouter​ HVPs with a random matrix ΩO​, performing ninner​ HVPs with another random matrix ΩI​, orthogonalizing the results of the outer products via QR decomposition to get a basis Q, and then solving a smaller nouter​×nouter​ eigenproblem involving Q and the inner products.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
# Simplified SEIGH Sketch
function SEIGH(Hessian_HVP, D, k, n_outer, n_inner):
# Generate random matrices
Omega_O = random_matrix(D, n_outer)
Omega_I = random_matrix(D, n_inner)
# Perform measurements (HVPs)
Y_O = Hessian_HVP(Omega_O) # D x n_outer
Y_I_products = Omega_I^T @ Y_O # n_inner x n_outer (approximation)
# Orthogonalize outer measurements
Q, _ = qr_decomposition(Y_O) # Q is D x n_outer
# Solve core eigenproblem (simplified view)
# Actual SEIGH involves solving least squares with pseudo-inverses
Core_Matrix = Q^T @ Hessian_HVP(Q) # n_outer x n_outer (approximation)
eigenvalues, eigenvectors_core = eigh(Core_Matrix)
# Recover top-k eigenvectors of H
eigenvectors_H = Q @ eigenvectors_core[:, :k] # D x k
eigenvalues_H = eigenvalues[:k]
return eigenvalues_H, eigenvectors_H |
- Experimental Validation:
- Small Scale (MNIST): On a small network (≈ 7k parameters) trained on downsampled MNIST, the paper first verifies the early crystallization phenomena for both magnitude masks and Hessian eigenspaces. It then computes the exact
overlap and shows that the SEIGH approximation closely tracks it. The measured overlap is consistently higher than the random chance baseline (Dk​).
- Large Scale (CIFAR-10, CIFAR-100, ImageNet): Using SEIGH, the study computes the approximate
overlap for larger networks (up to ResNet-18 with >11M parameters) and larger k (up to 1500). The results confirm that the overlap between magnitude masks and top Hessian eigenspaces is substantially higher than random chance across different models, datasets, training stages, and values of k. This effect is more pronounced for larger networks, with the overlap being up to 1000 times higher than the baseline for ResNet-18 on ImageNet.
Key Finding: There is a significant, non-random alignment between the parameters with the largest magnitudes and the directions of highest curvature (top Hessian eigenvectors) in deep neural networks, especially early in training and for larger models.
Practical Implications:
- Understanding Network Structure: Provides insight into the relationship between parameter importance (magnitude) and loss landscape geometry (curvature).
- Algorithm Development: The connection suggests potential for using cheap parameter magnitude information to approximate expensive Hessian properties for tasks like:
- Optimization: Designing adaptive optimizers.
- Pruning: Developing more principled pruning methods beyond simple magnitude.
- Uncertainty Quantification: Improving methods like Laplace approximation.
- Analysis Tool: SEIGH enables Hessian spectral analysis at unprecedented scales (k>1000 eigenpairs for models with >107 parameters).
Limitations: While SEIGH scales much better than previous methods, computing thousands of HVPs and storing k dense vectors still presents a computational challenge for extremely large models. The paper focuses on analysis rather than providing a ready-to-use downstream application incorporating this finding.