Average gradient outer product as a mechanism for deep neural collapse

Published 21 Feb 2024 in cs.LG and stat.ML | (2402.13728v6)

Abstract: Deep Neural Collapse (DNC) refers to the surprisingly rigid structure of the data representations in the final layers of Deep Neural Networks (DNNs). Though the phenomenon has been measured in a variety of settings, its emergence is typically explained via data-agnostic approaches, such as the unconstrained features model. In this work, we introduce a data-dependent setting where DNC forms due to feature learning through the average gradient outer product (AGOP). The AGOP is defined with respect to a learned predictor and is equal to the uncentered covariance matrix of its input-output gradients averaged over the training dataset. The Deep Recursive Feature Machine (Deep RFM) is a method that constructs a neural network by iteratively mapping the data with the AGOP and applying an untrained random feature map. We demonstrate empirically that DNC occurs in Deep RFM across standard settings as a consequence of the projection with the AGOP matrix computed at each layer. Further, we theoretically explain DNC in Deep RFM in an asymptotic setting and as a result of kernel learning. We then provide evidence that this mechanism holds for neural networks more generally. In particular, we show that the right singular vectors and values of the weights can be responsible for the majority of within-class variability collapse for DNNs trained in the feature learning regime. As observed in recent work, this singular structure is highly correlated with that of the AGOP.

Abstract PDF HTML Upgrade to Chat

References (47)

Citations (10)

View on Semantic Scholar

Summary

The paper identifies AGOP as the critical mechanism behind deep neural collapse by linking feature learning to weight dynamics.
It employs SVD, Deep Recursive Feature Machines, and kernel analysis to quantitatively show that weight projections significantly reduce within-class variance.
The study implies that modulating AGOP-induced structures could enhance transferability and out-of-distribution performance in deep networks.

Average Gradient Outer Product as a Mechanism for Deep Neural Collapse

Introduction

The paper "Average gradient outer product as a mechanism for deep neural collapse" (2402.13728) addresses the formation of Deep Neural Collapse (DNC) in deep neural networks (DNNs) by linking it to the process of feature learning governed by the Average Gradient Outer Product (AGOP). DNC is characterized by the convergence of within-class feature variability to zero and the emergence of equiangular tight frame (ETF) geometry among class means, commonly observed in the final and often intermediate layers of overparameterized DNNs trained for classification. Theoretical explanations have previously been dominated by feature-agnostic models, notably the (Deep) Unconstrained Features Model ((D)UFM), which treat neural representations as directly optimized variables, ignoring the dynamics and impact of data-induced feature learning. This work rigorously argues and demonstrates that the progression and optimality of DNC fundamentally arise from the AGOP-induced structure within the learned weights of DNNs, rather than merely from architectural non-linearities or feature-agnostic assumptions.

Mechanistic Analysis of DNC Formation

The authors first provide a formal and empirical decomposition of the mechanisms that yield within-class variability collapse (DNC1) in deep networks. Leveraging singular value decomposition (SVD), they distinguish the effects of the right singular space and singular values of each weight matrix in the network from the impact of nonlinear activations.

They demonstrate that in fully-connected networks, the application of the right singular space and singular values (i.e., projection into the "denoising" directions of the weight matrix) is responsible for the overwhelming majority of the decrease in within-class variance, as opposed to the nonlinear activation. This is quantified using the commonly used $tr(\Sigma_W) / tr(\Sigma_B)$ metric, which decreases primarily upon projection by $S_l V_l^\top$ in the SVD, indicating that this operation performs the essential denoising and feature selection for class discrimination.

Figure 1: Feature variability collapse from different SVD components on CIFAR-10, revealing a pronounced drop in within-class variability primarily after projection onto the right singular space of the weight matrix.

Subsequent layers and nonlinearities contribute negligibly to further collapse, indicating that the weight structure—determined during training—is the core operator effecting DNC1.

The Role of AGOP and Neural Feature Ansatz

The Gram matrix of the weight matrix, $W_l^\top W_l$ , is shown to be highly correlated with the AGOP associated with the input representations to that layer. This has been formalized as the Neural Feature Ansatz (NFA): in deep linear and nonlinear networks, the Gram matrix converges to a form proportional to the AGOP with respect to layer inputs.

The AGOP, defined as the empirical mean of the outer products of gradients of the model's outputs with respect to the current-layer inputs, encapsulates the task-specific, data-driven feature information. Through prior work and extensive experimental evidence, it has been shown that AGOP-based structure tracks the evolution of learned representations throughout depth, acting as both the signature and the driver of feature relearning and selectivity in DNNs.

Deep Recursive Feature Machines and DNC

To isolate the mechanistic role of AGOP in DNC independent of full differentiation-based training, the authors utilize the Deep Recursive Feature Machine (Deep RFM). Deep RFM recursively applies AGOP-based projections followed by random feature maps to the data, constructing a deep stack that models the principal component of DNN feature learning.

It is demonstrated that Deep RFM networks exhibit clear and strong DNC for both within-class variance (NC1) and ETF alignment of class means (NC2). Empirically, almost the entire reduction in NC1 and approach to NC2 optimality occurs immediately after projection by $M_l^{1/2}$ (where $M_l$ is the AGOP), with subsequent random feature maps either barely affecting or slightly degrading the metrics.

Figure 2: Progression of DNC1 and DNC2 in Deep RFM applied to CIFAR-10, showing sharp collapse in variability and class-mean ETF-ness after AGOP projection but not after the random features.

Visualization of inner products between all feature vectors in the top layers (after centering and normalization) exhibits the perfect class clustering and between-class ETF geometry characteristic of neural collapse.

Figure 3: Inner product matrix of feature vectors from multiple layers of Deep RFM on CIFAR-10; final layers show within-class vectors perfectly aligned and between-class vectors obeying ETF structure.

Experiments are extended to other datasets including SVHN and MNIST, substantiating that the emergence of collapse is robust to data domain and architecture.

Theoretical Underpinning

The authors offer both asymptotic and non-asymptotic theoretical analyses:

Asymptotic Setting: As data dimensionality and sample size increase, non-linear kernel maps become near-linear due to established random matrix results, and the recursive application of AGOP projections in Deep RFM yields convergence to idealized collapse. The reduction in NC1 is shown to be exponential in the number of layers, up to an error dependent on kernel non-linearity parameters.
Finite-Sample (Non-asymptotic) Setting: For analytically tractable kernel families (e.g., parametrized RBFs), the kernel ridge regression solution with AGOP-learned metrics essentially yields the Gram structure associated with DNC. The optimal kernel matrix in this framework is one whose block structure corresponds to class clustering, establishing that collapse is not merely heuristic but emerges as an optimal solution under natural regularization schemes.

AGOP vs. Random Features

A key technical result is the proof that random features and non-linearities in isolation cannot substantially reduce within-class variance. Any reduction in DNC1 due to the random feature map is minor and, in fact, tends to increase the effective distances. Analytical expressions and simulations confirm that only AGOP-driven projections introduce the sharpness of collapse characteristic of DNNs at convergence.

Figure 4: Relationship between $x - y$ and expected random feature distance, showing only modest, mostly monotonic expansions of vector distance after ReLU random features.

Implications and Future Directions

This study substantiates that DNC in DNNs is a consequence of feature learning orchestrated by weight evolution toward AGOP-induced structure—a process inextricably linked to the training data and the task. Unlike the UFM perspective, this work ties collapse behavior to the explicit, data-driven dynamics of end-to-end training.

Practically, these results imply that interventions designed to modulate or regularize AGOP structure, or architectures which embed projections aligned to AGOP, should systematically influence both the emergence and properties of DNC, as well as downstream phenomena such as transferability and out-of-distribution generalization. The theoretical bridges drawn to kernel learning and metric parameterization suggest new algorithmic avenues for efficiently inducing neural collapse-like representations in non-gradient-based or hybrid frameworks.

Theoretically, establishing AGOP as the mechanistic driver of DNC reframes the problem from one of abstract optimization geometry to a question of data-induced metric learning, sharpening possible research into alignment, simplicity bias, and the interaction of kernel and weight evolution in deep networks.

Conclusion

By combining rigorous SVD-based ablation, comprehensive empirical analysis with Deep RFMs, and theoretical treatment of recursive AGOP feature learning, the paper demonstrates that DNC in DNNs is predominantly a consequence of AGOP-driven feature selection. The findings offer a concrete data- and feature-centric explanation for neural collapse, advancing understanding beyond feature-agnostic models and opening new avenues for control and utilization of neural representation geometry in deep learning.