Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Published 24 Jan 2019 in cs.LG, cs.NE, and stat.ML | (1901.08584v2)

Abstract: Recent works have cast some light on the mystery of why deep nets fit any data and generalize despite being very overparametrized. This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: (i) Using a tighter characterization of training speed than papers, an explanation for why training a neural net with random labels leads to slower training, as originally observed in [Zhang et al. ICLR'17]. (ii) Generalization bound independent of network size, using a data-dependent complexity measure. Our measure distinguishes clearly between random labels and true labels on MNIST and CIFAR, as shown by experiments. Moreover, papers require sample complexity to increase (slowly) with the size, while our sample complexity is completely independent of the network size. (iii) Learnability of a broad class of smooth functions by 2-layer ReLU nets trained via gradient descent. The key idea is to track dynamics of training and generalization via properties of a related kernel.

Abstract PDF Upgrade to Chat

Citations (928)

View on Semantic Scholar

Summary

The paper demonstrates that training dynamics in overparameterized two-layer ReLU networks follow a power method-like iteration, yielding distinct convergence speeds for true versus random labels.
It introduces a generalization bound independent of network width, using a data-dependent complexity measure to effectively distinguish label quality on datasets like MNIST and CIFAR.
The study confirms that gradient descent can learn broad classes of smooth functions, offering practical insights for optimizing neural architecture and parameter design.

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

The paper "Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks" by Sanjeev Arora et al. provides an in-depth analysis of the training and generalization properties of overparameterized two-layer ReLU neural networks. The focus is on understanding why these networks can fit any data and generalize well despite a high number of parameters.

Key Contributions

The paper offers several key contributions:

Training Speed Characterization: It provides a tighter characterization of training speed than previous works, explaining why training neural networks with random labels leads to slower convergence compared to networks trained with true labels.
Generalization Bound Independent of Network Size: The work introduces a generalization bound based on a data-dependent complexity measure that does not depend on network size. This measure can clearly distinguish true labels from random labels in datasets such as MNIST and CIFAR.
Learnability of Smooth Functions: The paper demonstrates the learnability of a broad class of smooth functions using two-layer ReLU networks trained via gradient descent.

Analytical Framework

Optimization Analysis

The authors analyze the optimization trajectory of the neural network by tracking the dynamics of training. For the setting of a two-layer ReLU network, the updates effectively reduce to a power method iteration involving a specific Gram matrix derived from the data. Formally, the core dynamics are represented as:

$\tilde{\bm{u}}(k+1) = \tilde{\bm{u}}(k) - \eta \mathbf{H}^\infty (\tilde{\bm{u}}(k) - \bm{y})$

Here, $\mathbf{H}^\infty$ is a Gram matrix relevant to the ReLU kernel function. The paper proves that the magnitude of training loss at iteration $k$ can be closely approximated as:

$\Phi(\mathbf{W}(k)) \approx \frac{1}{2} \|(\mathbf{I} - \eta \mathbf{H}^\infty)^k \bm{y}\|_2^2$

This rigorous formulation allows the authors to explain why true labels result in faster convergence—true labels are better aligned with the dominant eigenvectors of $\mathbf{H}^\infty$ , thus converging faster under gradient descent.

Generalization Bound

The paper constructs a generalization bound of the form:

$L_D(f_{\mathbf{W}(k), \mathbf{a}}) \le \sqrt{\frac{2 \bm{y}^\top (\mathbf{H}^\infty)^{-1} \bm{y}}{n}} + O\left( \sqrt{\frac{\log(n/\delta)}{n}} \right)$

This result hinges on controlling the parameter movements during training, utilizing a specific matrix perturbation analysis. Importantly, the bound is derived without dependence on the network width $m$ , which marks a significant departure from common sample complexity analyses that usually scale with model size.

Implications and Experimental Validation

The authors validate their complexity measure by demonstrating its effectiveness in distinguishing true labels from random labels on MNIST and CIFAR datasets. They show that this measure can predict the generalization capability of the network before the training commences, offering a powerful tool for guiding architecture and parameter choices.

Future Directions

The paper lays the foundation for a more nuanced understanding of overparameterized neural networks, with future developments potentially exploring:

Extending the framework to deeper or more complex architectures.
Investigating the effects of different initialization schemes and learning rates on training dynamics and generalization.
Exploring other forms of data-dependent complexity measures and their empirical validation across diverse datasets and tasks.

Conclusion

This work provides a critical step towards demystifying the enigmatic generalization behaviors of overparameterized networks. The fine-grained analyses of both optimization dynamics and generalization bounds offer substantial advancements over prior theoretical models, potentially influencing future research in neural network training and validation methodologies.