Terminal Phase of Training
- Terminal Phase of Training (TPT) is the regime where a deep neural network perfectly fits the training data yet continues optimizing the loss, leading to significant structural evolution.
- During TPT, networks exhibit Neural Collapse phenomena with symmetric feature representations and classifier weights, which underpin improved generalization and adversarial robustness.
- Continued training in TPT drives margin maximization similar to SVM solutions, enhancing test accuracy and offering insights on mitigating class imbalance.
The Terminal Phase of Training (TPT) in @@@@1@@@@ designates the regime in which the empirical training error has first reached zero (i.e., the model perfectly interpolates all training samples), but the optimization persists—driving down the training loss (most commonly cross-entropy). TPT is characterized by dramatic structural evolution in network representations and classifier weights, leading to highly symmetric geometric and margin properties in feature space. In this regime, the network commonly exhibits Neural Collapse phenomena, which have direct implications for generalization, robustness, and inductive bias (Papyan et al., 2020, Gao et al., 2023, Kini et al., 2021).
1. Definition and Onset of Terminal Phase of Training
TPT is initiated at the earliest epoch where training error vanishes: . Importantly, despite perfect label interpolation, the loss function (e.g., cross-entropy) continues decreasing as optimization progresses. For overparameterized models—such as modern deep nets—TPT occupies a substantial fraction of total training and is distinct from the earlier fitting phase, as representations and classifier directions evolve markedly after (Papyan et al., 2020, Kini et al., 2021).
2. Neural Collapse Phenomena in TPT
During TPT, four interdependent phenomena, termed Neural Collapse (NC), emerge at the last classification layer:
- NC1: Variability Collapse Within each class , feature activations concentrate to their class mean , i.e., .
- NC2: Self-Dual Alignment Each class mean aligns with its corresponding classifier vector ; mathematically, .
- NC3: Simplex Equiangular Tight Frame (ETF) Geometry The collection of classifier vectors form a centered, equal-norm, pairwise equiangular system (simplex ETF):
with orthonormal, and .
- NC4: Nearest Class Center Rule Classification converges to assignment by nearest class mean,
These phenomena have been empirically validated on balanced datasets (e.g., MNIST, CIFAR-10/100, ImageNet) and architectures (VGG, ResNet, DenseNet) (Papyan et al., 2020, Gao et al., 2023).
3. Generalization and Margin Maximization in TPT
TPT is not merely a post-interpolation plateau: continued training increases the minimal margin between classes substantially. Given zero empirical error, features are linearly separable, and TPT optimization drives the network towards the hard-margin multi-class SVM solution:
Gradient flow under cross-entropy loss, without explicit hinge regularization, drives the margin to infinity for separable data—the same inductive bias as SVM optimization.
Multi-class margin-based generalization bounds become tighter as the margin grows:
$\Pbb_{(x,y)} \left[ \arg\max_c (M f(x))_c \neq y \right] \lesssim \sum \frac{\Rad(\Fcal)}{\gamma_{y, y'}} + \sum \sqrt{\frac{\log(\log_2(4K/\gamma_{y, y'}))}{N/C}}$
where both terms vanish as , guaranteeing improved test accuracy as TPT proceeds (Gao et al., 2023).
4. Effects and Variability of ETF Alignment ("Non-Conservative Generalization")
Despite global optimization equivalence among ETF configurations under rotation or permutation, not all configurations yield identical test performance. Real data exhibits heterogeneities in intra- and inter-class geometry, so permutations and rotations of the ETF–class assignment affect the effective pairwise margins :
Empirical experiments show substantial variance across ETF permutations and rotations even with identical NC optimization (boxplots in (Gao et al., 2023) Figure 1–6), a phenomenon termed "non-conservative generalization": not all simplex ETF minima generalize equally.
5. TPT Under Imbalance: Limitations and Remedies
Classical remedies for class imbalance (weighted cross-entropy, additive logit shifts) cannot influence class margins during TPT. Gradient descent in TPT converges, up to rescaling, to the SVM boundary, irrespective of additive weights. Only multiplicative logit scaling—such as cost-sensitive SVM adjustments—can shape interclass margin ratios:
The Vector-Scaling (VS) loss (Kini et al., 2021) unifies multiplicative and additive adjustments, tuning early convergence and asymptotic TPT margins. Under overparameterization, class-dependent margin ratios can be dialed to optimize balanced error or fairness metrics. Sharp asymptotic theory on Gaussian mixtures quantifies these trade-offs (Kini et al., 2021).
6. Empirical Findings and Practical Implications
Across architectures and datasets, TPT delivers:
- Generalization Improvement: On all benchmark pairs (dataset × net), final test accuracy exceeds the accuracy at TPT onset (e.g., CIFAR-10: ) (Papyan et al., 2020).
- Adversarial Robustness: Perturbation norms required for adversarial misclassification increase steadily during TPT (DeepFool metric).
- Interpretability: Final classifier decisions reduce to nearest class-mean assignment, with highly symmetric feature geometry.
- Margin Growth: Empirical measures confirm monotonic margin expansion in TPT (Gao et al., 2023).
- Non-Conservative Tuning: Exploring classifier initialization and simple permutations/rotations can yield better test accuracy via more favorable margin structure.
7. Broader Significance and Recommendations
Continuing training into the terminal phase is instrumental for maximizing test accuracy and robustness. The structural regularity induced by Neural Collapse simplifies the feature space and classifier geometry, underpinning improved generalization. For practitioners, early stopping at zero training error forfeits these benefits. Moreover, classifier initialization and label–feature assignments interact with data geometry to influence non-conservative generalization properties. Explicit regularization or multiple ETF alignments may be harnessed to identify optimally generalizing solutions (Papyan et al., 2020, Gao et al., 2023, Kini et al., 2021).