The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Published 9 Mar 2018 in cs.LG, cs.AI, and cs.NE | (1803.03635v5)

Abstract: Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance. We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the "lottery ticket hypothesis:" dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective. We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.

Abstract PDF Upgrade to Chat

Citations (3,174)

View on Semantic Scholar

Summary

The paper demonstrates that dense neural networks contain sparse, trainable subnetworks ('winning tickets') that achieve performance comparable to the full model.
It introduces an iterative pruning and weight-resetting algorithm that identifies efficient subnetworks with only 10-20% of the original parameters.
Empirical results on MNIST and CIFAR10 highlight that beneficial initializations enable faster training and improved accuracy in these winning tickets.

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Introduction

This paper introduces the Lottery Ticket Hypothesis, which posits that dense, randomly-initialized neural networks contain smaller subnetworks (referred to as "winning tickets") that can be trained in isolation to achieve test accuracy comparable to the original network's performance. The key insight is that these winning tickets have been initialized with weights that make them particularly amenable to learning. The researchers develop an algorithm to identify these winning tickets by pruning the network and resetting the surviving weights to their initial values. Experiments demonstrate that these subnetworks, when trained separately, outperform larger networks in terms of speed and accuracy.

Methodology

The paper outlines an experimental procedure involving iterative pruning and resetting weights to their initial values. Networks are trained, a percentage of weights are pruned, and the remaining weights are reset to their original initializations. This process is repeated to identify the smallest possible winning tickets. Essentially, the algorithm seeks subnetworks with a favorable combination of architecture and initialization, hypothesizing that such subnetworks can train effectively from the onset. The iterative pruning strategy tends to find winning tickets with fewer parameters than traditional one-shot pruning.

Results

Several architectures, including fully-connected networks for MNIST and convolutional networks for CIFAR10, are empirically shown to contain winning tickets that learn faster and achieve higher test accuracy when compared to their dense counterparts. These winning tickets can comprise as little as 10-20% of the original network size while retaining or even exceeding the original network's performance. Importantly, the study finds that reinitializing the weights randomly negates the benefits, reinforcing the significance of the initialization.

Insights into Initialization and Structure

The paper presents evidence that winning tickets are distinguished by particular beneficial initializations. Analysis suggests that weights in successful winning tickets change more during training compared to non-winning tickets, hinting at a favorable position in the loss landscape for optimization. The initial weights of the winning tickets appear to be a critical component of their success, indicating that initialization aligns with optimization algorithms to enhance training effectiveness.

Implications

The Lottery Ticket Hypothesis provides valuable insights into neural network optimization and architecture design. It suggests that networks contain smaller, effective representations, challenging the necessity for overparameterization seen in contemporary models. This perspective opens avenues for redesigning training schemes to focus on identifying and capitalizing on winning tickets from the outset. Additionally, understanding these inductive biases may lead to the development of improved initialization schemes or insights into the theoretical foundations of optimization and generalization in neural networks.

Limitations and Future Directions

This study is limited to vision-centric classification tasks on smaller datasets like MNIST and CIFAR10. The computational cost of iterative pruning on larger datasets, such as Imagenet, remains prohibitive. Future research may focus on more efficient methods for identifying winning tickets, evaluating the hypothesis in diverse settings, and exploring the interaction of winning tickets with dropout and other regularization techniques. Structured pruning methods optimized for modern hardware may also emerge from these findings.

Conclusion

The Lottery Ticket Hypothesis reshapes our understanding of neural network training by demonstrating that dense networks harbor sparse, highly trainable subnetworks. This paradigm is significant in advancing efficient network design and training methodologies, offering a promising pathway for optimizing performance with reduced computational resources. The concept invites further exploration into the mechanisms of successful network initializations and the broader implications for machine learning theory and practice.