Neural Collapse Dynamics: Depth, Activation, Regularisation, and Feature Norm Threshold

Published 31 Mar 2026 in cs.LG | (2604.00230v1)

Abstract: Neural collapse (NC) -- the convergence of penultimate-layer features to a simplex equiangular tight frame -- is well understood at equilibrium, but the dynamics governing its onset remain poorly characterised. We identify a simple and predictive regularity: NC occurs when the mean feature norm reaches a model-dataset-specific critical value, fn*, that is largely invariant to training conditions. This value concentrates tightly within each (model, dataset) pair (CV < 8%); training dynamics primarily affect the rate at which fn approaches fn*, rather than the value itself. In standard training trajectories, the crossing of fn below fn* consistently precedes NC onset, providing a practical predictor with a mean lead time of 62 epochs (MAE 24 epochs). A direct intervention experiment confirms fn* is a stable attractor of the gradient flow -- perturbations to feature scale are self-corrected during training, with convergence to the same value regardless of direction (p>0.2). Completing the (architecture)x(dataset) grid reveals the paper's strongest result: ResNet-20 on MNIST gives fn* = 5.867 -- a +458% architecture effect versus only +68% on CIFAR-10. The grid is strongly non-additive; fn* cannot be decomposed into independent architecture and dataset contributions. Four structural regularities emerge: (1) depth has a non-monotonic effect on collapse speed; (2) activation jointly determines both collapse speed and fn*; (3) weight decay defines a three-regime phase diagram -- too little slows, an optimal range is fastest, and too much prevents collapse; (4) width monotonically accelerates collapse while shifting fn* by at most 13%. These results establish feature-norm dynamics as an actionable diagnostic for predicting NC timing, suggesting that norm-threshold behaviour is a general mechanism underlying delayed representational reorganisation in deep networks.

Abstract PDF Upgrade to Chat

Authors (1)

Anamika Paul Rupa

Summary

The paper shows that neural collapse occurs when the mean feature norm reaches a specific threshold, serving as a reliable indicator of representational transition.
It systematically examines how depth, activation, weight decay, and width affect collapse speed, revealing non-monotonic and architecture–dataset interactions.
Empirical results from MNIST and CIFAR-10 demonstrate that while collapse timing is adjustable via hyperparameters, the threshold invariance remains robust.

Neural Collapse Dynamics: Structural Regularities and Feature Norm Thresholds

Summary and Motivation

"Neural Collapse Dynamics: Depth, Activation, Regularisation, and Feature Norm Threshold" (2604.00230) systematically investigates the temporal dynamics underlying neural collapse (NC)—the convergence of penultimate-layer features to a simplex equiangular tight frame—across varying architectures, activations, regularisation regimes, and widths. While prior research characterizes the equilibrium geometry and optimality of NC in deep networks, this work addresses the open question of what governs the timing and rate of collapse onset in practical training scenarios. A central empirical finding is that NC consistently occurs as the mean feature norm ( $\|\mathbf h\|$ ) descends to a sharply defined, model–dataset-specific threshold, regardless of training protocol or hyperparameter settings. This threshold proves robust and predictive, offering an actionable marker for impending representational reorganization.

Empirical Framework and Key Metrics

The analysis utilizes MNIST and CIFAR-10 as representative datasets, with architectures spanning multi-layer perceptrons (MLP) of variable depth and width, and ResNet-20. All models employ two-phase training: (1) cross-entropy to reach high accuracy, followed by (2) mean squared error (MSE) for driving collapse. The study focuses on four NC metrics: NC1 (within- to between-class scatter ratio), NC2 (class mean angle regularity), NC3 (classifier–class mean alignment), and the mean feature norm ( $\bar{\mathbf{h}}$ ), which serves as the critical progress variable.

Collapse Dynamics and Feature Norm Compression

The onset of NC is tightly linked to feature norm dynamics. For the MNIST MLP-5 baseline, feature norm compresses $25\times$ across two-phase training, with collapse marked by a stable threshold $\bar{\mathbf{h}} \approx 1.06$ .

Figure 1: MNIST baseline dynamics; NC1 collapses at epoch 310, accompanied by $25\times$ feature norm compression.

ResNet-20 on CIFAR-10 collapses at a higher threshold ( $\bar{\mathbf{h}} = 1.52$ ), demonstrating architecture–dataset specificity in equilibrium feature geometry.

Figure 2: CIFAR-10 ResNet-20 dynamics; collapse at epoch 660 with thresholded feature norm substantially above MNIST MLP values.

Depth, Activation, Width, and Regularisation Effects

The paper rigorously decomposes effects along several axes:

Depth: Collapse speed is non-monotonic; intermediate depth (MLP-3) is fastest, deeper models are slower, but all attain the same threshold.
Figure 3: Depth effect; intermediate depth accelerates collapse but does not alter threshold.
Activation: Both collapse speed and equilibrium threshold are determined by activation. ReLU yields lower thresholds, GELU is higher and more variable, Tanh is intermediate.
Figure 4: Activation functions jointly determine collapse speed and feature norm threshold.
Weight Decay: There is a Goldilocks regime; too little slows collapse, optimal values accelerate, excessive weight decay prevents collapse entirely. Yet the threshold is invariant over a $10\times$ range of $\lambda$ among successful regimes.
Figure 5: Weight decay sweep; $T_\text{NC}$ shifts by 90 epochs while threshold clustering remains tight.
Width: Width monotonically accelerates collapse but shifts the threshold by at most 13%, marking a decoupling of representational and convergence dynamics.
Figure 6: Width sweep; collapse speed improves but threshold norm remains nearly invariant.

A summary figure consolidates these findings, showing that collapse timing responds to architectural and hyperparameter manipulations, but threshold value is highly concentrated within each (model, dataset) pair.

Figure 7: Comprehensive summary of NC dynamics across depth, activation, weight decay, and width.

Architecture × Dataset Interaction

Crucially, completing the grid of ResNet-20 and MLP-5 across MNIST and CIFAR-10 reveals strongly non-additive interactions: the architecture effect (+458% on MNIST, only +68% on CIFAR-10) and dataset effect (up to -74%) are not separable and do not combine additively or multiplicatively.

Figure 8: ResNet-20 on MNIST exhibits an extreme threshold (5.867), demonstrating strong architecture–dataset interaction.

Predictive Value and Gradient Flow Attractor

Monitoring feature norm crossing below the threshold provides a mean predictive lead of 62 epochs to collapse, with a mean absolute error of 24 epochs. Intervention experiments (rescaling feature norm by $0.3\times$ or $\bar{\mathbf{h}}$ 0) show rapid convergence back to the same threshold, confirming it as a stable attractor under gradient flow, not a causal trigger.

Figure 9: Protocol validation shows CE-only training cannot induce collapse; feature norm dynamics underpin robust detection.

Theoretical Implications and Unified Dynamics

These empirical signatures—the norm threshold invariance, collapse prediction, and non-monotonic parameter effects—unify neural collapse with grokking (delayed generalization phase in algorithmic tasks), where norm thresholds similarly predict representational transitions. The findings indicate that delayed phase changes in deep networks—be it NC or grokking—are governed by norm dynamics specific to the architecture–dataset pair, rather than optimization per se.

Limitations

Dataset diversity is limited and small-sample caveats apply (notably for MLP-5/CIFAR-10), but effect sizes are large and consistent, making qualitative reversal unlikely. The width scaling regression is underpowered and should be interpreted as an upper bound.

Practical and Theoretical Implications

Monitoring feature norm offers a reliable early-warning diagnostic for NC onset. The results highlight principled control levers (weight decay, width, depth) for collapse timing. The paradigm reframes NC theory from equilibrium geometry toward dynamics and thresholds, motivating further investigation into the mechanism—positive homogeneity, accumulated implicit regularization, BatchNorm-induced norm pinning, and gradient flow structure.

Conclusion

The work establishes that mean feature norm at collapse is a reproducible, highly concentrated structural property, specific to (model, dataset) pairs and predictive of NC onset. The collapse threshold is invariant under extensive hyperparameter variation and reveals a strong architecture–dataset interaction. Collapse timing can be controlled, but not the threshold value, except by jointly changing architecture or dataset. These structural regularities align closely with grokking dynamics, suggesting norm-threshold-driven delayed phase transitions are a general organizing principle in deep learning representation theory.

Markdown Report Issue