Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss

Published 11 Feb 2020 in math.OC, cs.LG, and stat.ML | (2002.04486v4)

Abstract: Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activation and confirm the statistical benefits of this implicit bias.

Abstract PDF Upgrade to Chat

Citations (315)

View on Semantic Scholar

Summary

The paper demonstrates that gradient descent in wide two-layer networks converges to max-margin classifiers over non-Hilbertian spaces.
It employs a Wasserstein gradient flow framework to analyze training dynamics and derives dimension-independent generalization bounds for low-dimensional data.
Experimental comparisons reveal that training both layers yields behavior distinct from output-only training, surpassing traditional kernel methods.

Implicit Bias of Wide Two-layer Neural Networks

The paper "Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss" explores the training dynamics and generalization behavior of wide two-layer neural networks with homogeneous activations, such as ReLU, using gradient descent on the logistic loss or losses with exponential tails. The authors aim to elucidate why these neural networks perform well in over-parameterized settings where standard learning theory would predict overfitting.

Key Contributions and Findings

Implicit Bias and Max-Margin Classifiers: The study characterizes the limit behavior of gradient descent as reaching a max-margin solution over non-Hilbertian functional spaces, particularly the variation norm space for infinitely wide two-layer neural networks. The results highlight that in settings with low-dimensional structures within the data, the resulting margin does not depend on the ambient dimension, leading to potentially strong generalization bounds.
Gradient Flow Characterization: It is shown that the gradient flow of an over-parameterized two-layer neural network can be viewed as a Wasserstein gradient flow. This provides a framework to understand and analyze the dynamics in the infinite width limit.
Comparison with Output Layer Training: When only the output layer is trained, the network corresponds to a kernel support vector machine (SVM) using a radial basis function (RBF) kernel approximation. The contrast in implicit bias between training only the output layer and both layers is highlighted, with the former aligning more with traditional kernel methods.
Numerical and Statistical Observations: Numerical experiments conducted with two-layer ReLU networks validate the statistical efficiency of the implicit bias towards max-margin classifiers in high-dimensional spaces. The experiments suggest significant performance benefits and efficiency due to the implicit bias introduced by full neural network training.
Generalization Bounds: Generalization bounds are derived, showing dimension-independent bounds in scenarios where data have hidden low-dimensional structures. These bounds argue for favorable generalization behavior when training two-layer neural networks in high dimensions, relative to standard kernel methods.

Theoretical and Practical Implications

This research offers a nuanced understanding of why wide two-layer neural networks generalize well despite being over-parameterized. Practically, it suggests that such networks naturally exploit low-dimensional structures in data, thereby achieving effective learning without overfitting. Theoretically, it connects neural network training dynamics to broader optimization concepts, such as gradient descent achieving max-margin classifiers.

Open Problems and Future Directions

Runtime and Convergence Rates: The paper establishes asymptotic properties of the gradient flow but leaves open questions regarding convergence rates and the exact runtime required to achieve the implicit bias in practice. Future work is suggested to make these results quantitative with respect to the number of neurons and iterations.
Beyond Simplified Settings: Extending the present findings to more complex architectures, including deeper networks and different loss functions, remains an open challenge. Additional work could focus on exploring convex relaxation techniques and their potential adaptations in non-convex neural network settings.
Empirical Validation: While theoretical foundations are laid, further empirical validation across a diverse range of datasets and neural network architectures could enhance practical understanding and adoption of these insights.

In summary, the paper advances the understanding of implicit biases in neural network training, bridging foundational optimization insights with practical generalization performance, and setting a path for systematic exploration of these phenomena in more complex settings within machine learning and neural network theory.

Markdown Report Issue