Maxwell's Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons
Abstract: When training deep neural networks, the phenomenon of $\textit{dying neurons}$ $\unicode{x2013}$units that become inactive or saturated, output zero during training$\unicode{x2013}$ has traditionally been viewed as undesirable, linked with optimization challenges, and contributing to plasticity loss in continual learning scenarios. In this paper, we reassess this phenomenon, focusing on sparsity and pruning. By systematically exploring the impact of various hyperparameter configurations on dying neurons, we unveil their potential to facilitate simple yet effective structured pruning algorithms. We introduce $\textit{Demon Pruning}$ (DemP), a method that controls the proliferation of dead neurons, dynamically leading to network sparsity. Achieved through a combination of noise injection on active units and a one-cycled schedule regularization strategy, DemP stands out for its simplicity and broad applicability. Experiments on CIFAR10 and ImageNet datasets demonstrate that DemP surpasses existing structured pruning techniques, showcasing superior accuracy-sparsity tradeoffs and training speedups. These findings suggest a novel perspective on dying neurons as a valuable resource for efficient model compression and optimization.
- Loss of plasticity in continual deep reinforcement learning. CoRR, abs/2303.07507, 2023.
- Agarap, A. F. Deep learning using rectified linear units (relu). CoRR, abs/1803.08375, 2018.
- Disentangling adaptive gradient methods from learning rates. CoRR, abs/2002.11803, 2020.
- Anderson, R. W. Chapter 7 - biased random-walk learning: A neurobiological correlate to trial-and-error. In Neural Networks and Pattern Recognition, pp. 221–244. Academic Press, San Diego, 1998.
- A study on the plasticity of neural networks. arXiv preprint arXiv:2106.00042, 2021.
- JAX: composable transformations of Python+NumPy programs, 2018.
- Stochastic gradient and Langevin processes. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 1810–1819. PMLR, 13–18 Jul 2020.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pp. 248–255. IEEE Computer Society, 2009.
- Auto-balanced filter pruning for efficient convolutional neural networks. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 6797–6804. AAAI Press, 2018.
- Continual backprop: Stochastic gradient descent with persistent randomness. CoRR, abs/2108.06325, 2021.
- Sample-efficient reinforcement learning by breaking the replay ratio barrier. In The Eleventh International Conference on Learning Representations, 2022.
- Fast sparse convnets. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14629–14638, 2020.
- Evci, U. Detecting dead weights and units in neural networks. CoRR, abs/1806.06068, 2018.
- Rigging the lottery: Making all tickets winners. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 2943–2952. PMLR, 2020.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.
- Linear mode connectivity and the lottery ticket hypothesis. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 3259–3269. PMLR, 2020.
- The state of sparsity in deep neural networks. CoRR, abs/1902.09574, 2019.
- Sparse gpu kernels for deep learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14. IEEE, 2020.
- Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.
- Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 1135–1143, 2015.
- Control batch size and learning rate to generalize well: Theoretical and empirical evidence. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 1141–1150, 2019.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society, 2016.
- Towards the systematic reporting of the energy and carbon footprints of machine learning. The Journal of Machine Learning Research, 21(1):10039–10081, 2020.
- Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415, 2016.
- Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 3215–3222. AAAI Press, 2018.
- Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. CoRR, abs/1607.03250, 2016.
- Brownian motion and stochastic calculus. Springer New York, NY, 2014.
- On large-batch training for deep learning: Generalization gap and sharp minima. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
- Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- Overcoming catastrophic forgetting in neural networks. CoRR, abs/1612.00796, 2016.
- Learning multiple layers of features from tiny images. 2009.
- Quantifying the carbon emissions of machine learning. CoRR, abs/1910.09700, 2019.
- Dynamic sparse training with structured sparsity. CoRR, abs/2305.02299, 2023.
- Lawler, G. Introduction to Stochastic Calculus with Applications. Taylor & Francis, 2016.
- Fast convnets using group-wise brain damage. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 2554–2564. IEEE Computer Society, 2016.
- Optimal brain damage. In Advances in Neural Information Processing Systems 2, [NIPS Conference, Denver, Colorado, USA, November 27-30, 1989], pp. 598–605. Morgan Kaufmann, 1989.
- Jaxpruner: A concise library for sparsity research. 2023.
- Pruning filters for efficient convnets. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
- Towards explaining the regularization effect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 11669–11680, 2019.
- Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019.
- Learning efficient convolutional networks through network slimming. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 2755–2763. IEEE Computer Society, 2017a.
- Learning efficient convolutional networks through network slimming. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 2755–2763. IEEE Computer Society, 2017b.
- Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.
- Learning sparse neural networks through l_0 regularization. 2018.
- Dying relu and initialization: Theory and numerical examples. arXiv preprint arXiv:1903.06733, 2019.
- Understanding and preventing capacity loss in reinforcement learning. In International Conference on Learning Representations, 2022.
- Understanding plasticity in neural networks. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 23190–23211. PMLR, 2023.
- Rectifier nonlinearities improve neural network acoustic models. In in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.
- Revisiting small batch training for deep neural networks. CoRR, abs/1804.07612, 2018.
- Maxwell, J. Theory of Heat. Textbooks of science. Longmans, Green, and Company, 1872. ISBN 9780598862662.
- Relu strikes back: Exploiting activation sparsity in large language models. CoRR, abs/2310.04564, 2023.
- Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):2383, 2018.
- The primacy bias in deep reinforcement learning. In International conference on machine learning, pp. 16828–16847. PMLR, 2022.
- Oksendal, B. K. Stochastic differential equations: an introduction with applications. 6th edition. Springer Berlin, Heidelberg, 2010.
- Pillaud-Vivien, L. Rethinking sgd’s noise. https://francisbach.com/implicit-bias-sgd/, 2022. Accessed: 2023-08-26.
- Winning the lottery ahead of time: Efficient early network pruning. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 18293–18309. PMLR, 2022.
- Searching for activation functions. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings, 2018.
- Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- Super-convergence: Very fast training of neural networks using large learning rates, 2018.
- Don’t decay the learning rate, increase the batch size. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
- The dormant neuron phenomenon in deep reinforcement learning. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 32145–32168. PMLR, 2023.
- Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 3645–3650. Association for Computational Linguistics, 2019.
- Parametric exponential linear unit for deep convolutional neural networks. In 2017 16th IEEE international conference on machine learning and applications (ICMLA), pp. 207–214. IEEE, 2017.
- Pruning via iterative ranking of sensitivity statistics. CoRR, abs/2006.00896, 2020.
- Picking winning tickets before training by preserving gradient flow. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020.
- Structured pruning for efficient convnets via incremental regularization. In International Joint Conference on Neural Networks, IJCNN 2019 Budapest, Hungary, July 14-19, 2019, pp. 1–8. IEEE, 2019.
- Neural pruning via growing regularization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021.
- Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 2074–2082, 2016.
- Wojtowytsch, S. Stochastic gradient descent with noise of machine learning type part I: discrete time analysis. J. Nonlinear Sci., 33(3):45, 2023.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for Computational Linguistics.
- Empirical evaluation of rectified activations in convolutional network. CoRR, abs/1505.00853, 2015.
- Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
- Gate decorator: Global filter pruning method for accelerating deep convolutional neural networks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 2130–2141, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.