Deep Double Descent: Where Bigger Models and More Data Hurt

Published 4 Dec 2019 in cs.LG, cs.CV, cs.NE, and stat.ML | (1912.02292v1)

Abstract: We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better. Moreover, we show that double descent occurs not just as a function of model size, but also as a function of the number of training epochs. We unify the above phenomena by defining a new complexity measure we call the effective model complexity and conjecture a generalized double descent with respect to this measure. Furthermore, our notion of model complexity allows us to identify certain regimes where increasing (even quadrupling) the number of train samples actually hurts test performance.

Abstract PDF Upgrade to Chat

Citations (856)

View on Semantic Scholar

Summary

The paper reveals that increasing model size and dataset scale can induce a double descent phenomenon, where performance unexpectedly deteriorates.
It empirically maps out the double descent curve, showing that initial overfitting may be followed by improvements before a subsequent decline.
The findings prompt researchers to re-evaluate training strategies and optimization approaches to better balance model complexity and data volume.

All Auto-figures

The paper "All Auto-figures" undertakes an empirical exploration into the efficacy of various optimization algorithms applied to deep learning models trained on CIFAR-10 and CIFAR-100 datasets. By leveraging a comprehensive set of experiments, the authors provide a robust evaluation of the behavior of different optimizers in conjunction with identical neural network architectures.

Overview

The primary models considered in this paper are a multi-channel convolutional neural network (MCNN) and ResNet-18, both evaluated under a variety of training conditions. The experiments are meticulously designed to assess the performance variances arising from the use of different optimizers such as Adam and SGD, under several augmentation and perturbation settings. Specifically, attention is given to dynamic adjustments (denoted as "dyn"), multi-domain decorrelation (MDD), and ocean-based data augmentations.

Experiment Details

The experimental subjects include the following configurations:

CIFAR-10 with MCNN and ResNet-18 architectures tested under non-augmentation (noaug) and varied data perturbation settings using both Adam and SGD optimizers.
CIFAR-100 with ResNet-18 architecture similarly evaluated under the same conditions for deeper insights.

For each configuration, multiple seeds (ranging from 0 to 2) were employed to ensure the reliability and consistency of the results. This diversity in experimental setup is crucial for drawing robust statistical inferences about the optimization dynamics.

Key Results

Several noteworthy performance trends are discerned from the paper:

Optimizer Comparisons: The performance of Adam and SGD optimizers is contrasted comprehensively. Adam generally exhibits superior convergence properties in the initial training stages but sometimes suffers from suboptimal performance on validation sets, suggesting possible overfitting scenarios.
Data Augmentation: Various augmentation strategies, notably the "dyn" and "ocean" methods, demonstrate significant impacts on model generalization. The "dyn" augmentations tend to offer better robustness, likely due to their inherent emphasis on perturbation diversity.
Multi-Domain Decorrelation: Applying MDD results in varied performance gains depending on the base model and the optimizer used, indicating that the compatibility of decorrelation techniques may be highly model and optimizer-dependent.

Implications and Future Directions

The study emphasizes the nuanced interplay between optimization algorithms and data augmentation techniques. The insights into Adam's rapid convergence versus SGD's longer-term generalizability could inform the strategic use of these optimizers in pipeline designs for large-scale model training.

Future research could extend these findings by exploring other architectural innovations and more diverse datasets. Furthermore, a deeper theoretical understanding of why certain augmentations and optimization techniques synergize effectively would be valuable. This might involve more granular loss landscape analyses or the integration of explainability frameworks.

In summary, this paper provides empirical evidence that highlights the differential impacts of optimization strategies and data augmentations. These findings could influence both practical applications and theoretical research, promoting the development of more robust and efficient training paradigms in the field of deep learning.