Papers
Topics
Authors
Recent
Search
2000 character limit reached

Average gradient outer product as a mechanism for deep neural collapse

Published 21 Feb 2024 in cs.LG and stat.ML | (2402.13728v6)

Abstract: Deep Neural Collapse (DNC) refers to the surprisingly rigid structure of the data representations in the final layers of Deep Neural Networks (DNNs). Though the phenomenon has been measured in a variety of settings, its emergence is typically explained via data-agnostic approaches, such as the unconstrained features model. In this work, we introduce a data-dependent setting where DNC forms due to feature learning through the average gradient outer product (AGOP). The AGOP is defined with respect to a learned predictor and is equal to the uncentered covariance matrix of its input-output gradients averaged over the training dataset. The Deep Recursive Feature Machine (Deep RFM) is a method that constructs a neural network by iteratively mapping the data with the AGOP and applying an untrained random feature map. We demonstrate empirically that DNC occurs in Deep RFM across standard settings as a consequence of the projection with the AGOP matrix computed at each layer. Further, we theoretically explain DNC in Deep RFM in an asymptotic setting and as a result of kernel learning. We then provide evidence that this mechanism holds for neural networks more generally. In particular, we show that the right singular vectors and values of the weights can be responsible for the majority of within-class variability collapse for DNNs trained in the feature learning regime. As observed in recent work, this singular structure is highly correlated with that of the AGOP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pp.  74–84. PMLR, 2020.
  2. A random matrix perspective on mixtures of nonlinearities for deep learning. arXiv preprint arXiv:1912.00827, 2019.
  3. Mechanism of feature learning in convolutional neural networks. arXiv preprint arXiv:2309.00570, 2023.
  4. Gradient descent induces alignment between weights and the empirical ntk for deep non-linear networks. arXiv preprint arXiv:2402.05271, 2024.
  5. Kernel learning in ridge regression” automatically” yields exact low rank solution. arXiv preprint arXiv:2310.11736, 2023.
  6. Neural collapse in deep linear network: From balanced to imbalanced data. arXiv preprint arXiv:2301.00437, 2023.
  7. Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. In Proceedings of the National Academy of Sciences (PNAS), volume 118, 2021.
  8. Improved generalization bounds for transfer learning via neural collapse. In First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML, 2022.
  9. Linking neural collapse and l2 normalization with improved out-of-distribution detection in deep neural networks. Transactions on Machine Learning Research (TMLR), 2022.
  10. Neural collapse under mse loss: Proximity to and dynamics on the central path. In International Conference on Learning Representations (ICLR), 2022.
  11. A law of data separation in deep learning. arXiv preprint arXiv:2210.17020, 2022.
  12. Neural collapse for unconstrained feature model under cross-entropy loss with imbalanced data. arXiv preprint arXiv:2309.09725, 2023.
  13. Universality laws for high-dimensional learning with random features. IEEE Transactions on Information Theory, 69(3):1932–1964, 2022.
  14. Limitations of neural collapse for understanding generalization in deep learning. arXiv preprint arXiv:2202.08384, 2022.
  15. Neural tangent kernel: Convergence and generalization in neural networks. In Conference on Neural Information Processing Systems (NeurIPS), 2018.
  16. An unconstrained layer-peeled perspective on neural collapse. In International Conference on Learning Representations (ICLR), 2022.
  17. Karoui, N. E. The spectrum of kernel random matrices. The Annals of Statistics, pp.  1–50, 2010.
  18. Kothapalli, V. Neural collapse: A review on modelling principles and generalization. In Transactions on Machine Learning Research (TMLR), 2023.
  19. The asymmetric maximum margin bias of quasi-homogeneous neural networks. arXiv preprint arXiv:2210.03820, 2022.
  20. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning. arXiv preprint arXiv:2012.09839, 2020.
  21. Relu soothes the ntk condition number and accelerates optimization for wide neural networks. arXiv preprint arXiv:2305.08813, 2023.
  22. Neural collapse under cross-entropy loss. In Applied and Computational Harmonic Analysis, volume 59, 2022.
  23. Neural collapse with unconstrained features. arXiv preprint arXiv:2011.11619, 2020.
  24. Prevalence of neural collapse during the terminal phase of deep learning training. In Proceedings of the National Academy of Sciences (PNAS), volume 117, 2020.
  25. Neural collapse in the intermediate hidden layers of classification neural networks. arXiv preprint arXiv:2308.02760, 2023.
  26. Explicit regularization and implicit bias in deep network classifiers trained with the square loss. arXiv preprint arXiv:2101.00072, 2020.
  27. Feature learning in neural networks and kernel machines that recursively learn features. arXiv preprint arXiv:2212.13881, 2022.
  28. Linear recursive feature machines provably recover low-rank matrices. arXiv preprint arXiv:2401.04553, 2024.
  29. Feature learning in deep classifiers through intermediate neural collapse. Technical Report, 2023.
  30. Roughgarden, T. Beyond worst-case analysis. Communications of the ACM, 62(3):88–96, 2019.
  31. A generalized representer theorem. In International conference on computational learning theory, pp.  416–426. Springer, 2001.
  32. Neural (tangent kernel) collapse. arXiv preprint arXiv:2305.16427, 2023.
  33. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463, 2004.
  34. On the robustness of neural collapse and the neural collapse of robustness. arXiv preprint arXiv:2311.07444, 2023.
  35. Deep neural collapse is provably optimal for the deep unconstrained features model. arXiv preprint arXiv:2305.13165, 2023.
  36. Imbalance trouble: Revisiting neural-collapse geometry. In Conference on Neural Information Processing Systems (NeurIPS), 2022.
  37. Extended unconstrained features model for exploring deep neural collapse. In International Conference on Machine Learning (ICML), 2022.
  38. Perturbation analysis of neural collapse. arXiv preprint arXiv:2210.16658, 2022.
  39. A consistent estimator of the expected gradient outerproduct. In UAI, pp.  819–828, 2014.
  40. Linear convergence analysis of neural collapse with unconstrained features. In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022.
  41. How far pre-trained models are from neural collapse on the target dataset informs their transferability. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  5549–5558, 2023.
  42. On the emergence of simplex symmetry in the final and penultimate layers of neural network classifiers. In Mathematical and Scientific Machine Learning, 2022.
  43. Woodbury, M. A. Inverting modified matrices. Department of Statistics, Princeton University, 1950.
  44. Dynamics in deep classifiers trained with the square loss: Normalization, low rank, neural collapse, and generalization bounds. In Research, volume 6, 2023.
  45. Efficient estimation of the central mean subspace via smoothed gradient outer products. arXiv preprint arXiv:2312.15469, 2023.
  46. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning (ICML), 2022.
  47. Catapults in sgd: spikes in the training loss and their impact on generalization through feature learning. arXiv preprint arXiv:2306.04815, 2023.
Citations (10)

Summary

  • The paper identifies AGOP as the critical mechanism behind deep neural collapse by linking feature learning to weight dynamics.
  • It employs SVD, Deep Recursive Feature Machines, and kernel analysis to quantitatively show that weight projections significantly reduce within-class variance.
  • The study implies that modulating AGOP-induced structures could enhance transferability and out-of-distribution performance in deep networks.

Average Gradient Outer Product as a Mechanism for Deep Neural Collapse

Introduction

The paper "Average gradient outer product as a mechanism for deep neural collapse" (2402.13728) addresses the formation of Deep Neural Collapse (DNC) in deep neural networks (DNNs) by linking it to the process of feature learning governed by the Average Gradient Outer Product (AGOP). DNC is characterized by the convergence of within-class feature variability to zero and the emergence of equiangular tight frame (ETF) geometry among class means, commonly observed in the final and often intermediate layers of overparameterized DNNs trained for classification. Theoretical explanations have previously been dominated by feature-agnostic models, notably the (Deep) Unconstrained Features Model ((D)UFM), which treat neural representations as directly optimized variables, ignoring the dynamics and impact of data-induced feature learning. This work rigorously argues and demonstrates that the progression and optimality of DNC fundamentally arise from the AGOP-induced structure within the learned weights of DNNs, rather than merely from architectural non-linearities or feature-agnostic assumptions.

Mechanistic Analysis of DNC Formation

The authors first provide a formal and empirical decomposition of the mechanisms that yield within-class variability collapse (DNC1) in deep networks. Leveraging singular value decomposition (SVD), they distinguish the effects of the right singular space and singular values of each weight matrix in the network from the impact of nonlinear activations.

They demonstrate that in fully-connected networks, the application of the right singular space and singular values (i.e., projection into the "denoising" directions of the weight matrix) is responsible for the overwhelming majority of the decrease in within-class variance, as opposed to the nonlinear activation. This is quantified using the commonly used tr(ΣW)/tr(ΣB)tr(\Sigma_W) / tr(\Sigma_B) metric, which decreases primarily upon projection by SlVlS_l V_l^\top in the SVD, indicating that this operation performs the essential denoising and feature selection for class discrimination. Figure 1

Figure 1: Feature variability collapse from different SVD components on CIFAR-10, revealing a pronounced drop in within-class variability primarily after projection onto the right singular space of the weight matrix.

Subsequent layers and nonlinearities contribute negligibly to further collapse, indicating that the weight structure—determined during training—is the core operator effecting DNC1.

The Role of AGOP and Neural Feature Ansatz

The Gram matrix of the weight matrix, WlWlW_l^\top W_l, is shown to be highly correlated with the AGOP associated with the input representations to that layer. This has been formalized as the Neural Feature Ansatz (NFA): in deep linear and nonlinear networks, the Gram matrix converges to a form proportional to the AGOP with respect to layer inputs.

The AGOP, defined as the empirical mean of the outer products of gradients of the model's outputs with respect to the current-layer inputs, encapsulates the task-specific, data-driven feature information. Through prior work and extensive experimental evidence, it has been shown that AGOP-based structure tracks the evolution of learned representations throughout depth, acting as both the signature and the driver of feature relearning and selectivity in DNNs.

Deep Recursive Feature Machines and DNC

To isolate the mechanistic role of AGOP in DNC independent of full differentiation-based training, the authors utilize the Deep Recursive Feature Machine (Deep RFM). Deep RFM recursively applies AGOP-based projections followed by random feature maps to the data, constructing a deep stack that models the principal component of DNN feature learning.

It is demonstrated that Deep RFM networks exhibit clear and strong DNC for both within-class variance (NC1) and ETF alignment of class means (NC2). Empirically, almost the entire reduction in NC1 and approach to NC2 optimality occurs immediately after projection by Ml1/2M_l^{1/2} (where MlM_l is the AGOP), with subsequent random feature maps either barely affecting or slightly degrading the metrics. Figure 2

Figure 2

Figure 2: Progression of DNC1 and DNC2 in Deep RFM applied to CIFAR-10, showing sharp collapse in variability and class-mean ETF-ness after AGOP projection but not after the random features.

Visualization of inner products between all feature vectors in the top layers (after centering and normalization) exhibits the perfect class clustering and between-class ETF geometry characteristic of neural collapse. Figure 3

Figure 3: Inner product matrix of feature vectors from multiple layers of Deep RFM on CIFAR-10; final layers show within-class vectors perfectly aligned and between-class vectors obeying ETF structure.

Experiments are extended to other datasets including SVHN and MNIST, substantiating that the emergence of collapse is robust to data domain and architecture.

Theoretical Underpinning

The authors offer both asymptotic and non-asymptotic theoretical analyses:

  • Asymptotic Setting: As data dimensionality and sample size increase, non-linear kernel maps become near-linear due to established random matrix results, and the recursive application of AGOP projections in Deep RFM yields convergence to idealized collapse. The reduction in NC1 is shown to be exponential in the number of layers, up to an error dependent on kernel non-linearity parameters.
  • Finite-Sample (Non-asymptotic) Setting: For analytically tractable kernel families (e.g., parametrized RBFs), the kernel ridge regression solution with AGOP-learned metrics essentially yields the Gram structure associated with DNC. The optimal kernel matrix in this framework is one whose block structure corresponds to class clustering, establishing that collapse is not merely heuristic but emerges as an optimal solution under natural regularization schemes.

AGOP vs. Random Features

A key technical result is the proof that random features and non-linearities in isolation cannot substantially reduce within-class variance. Any reduction in DNC1 due to the random feature map is minor and, in fact, tends to increase the effective distances. Analytical expressions and simulations confirm that only AGOP-driven projections introduce the sharpness of collapse characteristic of DNNs at convergence. Figure 4

Figure 4

Figure 4: Relationship between xyx - y and expected random feature distance, showing only modest, mostly monotonic expansions of vector distance after ReLU random features.

Implications and Future Directions

This study substantiates that DNC in DNNs is a consequence of feature learning orchestrated by weight evolution toward AGOP-induced structure—a process inextricably linked to the training data and the task. Unlike the UFM perspective, this work ties collapse behavior to the explicit, data-driven dynamics of end-to-end training.

Practically, these results imply that interventions designed to modulate or regularize AGOP structure, or architectures which embed projections aligned to AGOP, should systematically influence both the emergence and properties of DNC, as well as downstream phenomena such as transferability and out-of-distribution generalization. The theoretical bridges drawn to kernel learning and metric parameterization suggest new algorithmic avenues for efficiently inducing neural collapse-like representations in non-gradient-based or hybrid frameworks.

Theoretically, establishing AGOP as the mechanistic driver of DNC reframes the problem from one of abstract optimization geometry to a question of data-induced metric learning, sharpening possible research into alignment, simplicity bias, and the interaction of kernel and weight evolution in deep networks.

Conclusion

By combining rigorous SVD-based ablation, comprehensive empirical analysis with Deep RFMs, and theoretical treatment of recursive AGOP feature learning, the paper demonstrates that DNC in DNNs is predominantly a consequence of AGOP-driven feature selection. The findings offer a concrete data- and feature-centric explanation for neural collapse, advancing understanding beyond feature-agnostic models and opening new avenues for control and utilization of neural representation geometry in deep learning.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 40 likes about this paper.