Papers
Topics
Authors
Recent
Search
2000 character limit reached

Compressing the Backward Pass of Large-Scale Neural Architectures by Structured Activation Pruning

Published 28 Nov 2023 in cs.LG and cs.PF | (2311.16883v2)

Abstract: The rise of Deep Neural Networks (DNNs) has led to an increase in model size and complexity, straining the memory capacity of GPUs. Sparsity in DNNs, characterized as structural or ephemeral, has gained attention as a solution. This work focuses on ephemeral sparsity, aiming to reduce memory consumption during training. It emphasizes the significance of activations, an often overlooked component, and their role in memory usage. This work employs structured pruning in Block Sparse Compressed Row (BSR) format in combination with a magnitude-based criterion to efficiently prune activations. We furthermore introduce efficient block-sparse operators for GPUs and showcase their effectiveness, as well as the superior compression offered by block sparsity. We report the effectiveness of activation pruning by evaluating training speed, accuracy, and memory usage of large-scale neural architectures on the example of ResMLP on image classification tasks. As a result, we observe a memory reduction of up to 32% while maintaining accuracy. Ultimately, our approach aims to democratize large-scale model training, reduce GPU requirements, and address ecological concerns.

Authors (2)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. J. Lee, L. Mukhanov, A. S. Molahosseini, U. Minhas, Y. Hua, J. Martinez del Rincon, K. Dichev, C.-H. Hong, and H. Vandierendonck, “Resource-efficient convolutional networks: A survey on model-, arithmetic-, and implementation-level techniques,” ACM Comput. Surv., vol. 55, no. 13s, Jul 2023. [Online]. Available: https://doi.org/10.1145/3587095
  2. W. Roth, G. Schindler, M. Zöhrer, L. Pfeifenberger, R. Peharz, S. Tschiatschek, H. Fröning, F. Pernkopf, and Z. Ghahramani, “Resource-efficient neural networks for embedded systems,” CoRR, vol. abs/2001.03048, 2020. [Online]. Available: http://arxiv.org/abs/2001.03048
  3. Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the value of network pruning,” CoRR, vol. abs/1810.05270, 2018. [Online]. Available: http://arxiv.org/abs/1810.05270
  4. L. Prechelt, “Connection pruning with static and adaptive pruning schedules,” vol. 16, no. 1, pp. 49–61. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231296000549
  5. G. Schindler, W. Roth, F. Pernkopf, and H. Froening, “Parameterized Structured Pruning for Deep Neural Networks,” CoRR, vol. abs/1906.05180. [Online]. Available: http://arxiv.org/abs/1906.05180
  6. H. Touvron, P. Bojanowski, M. Caron, M. Cord, A. El-Nouby, E. Grave, G. Izacard, A. Joulin, G. Synnaeve, J. Verbeek, and H. Jégou, “ResMLP: Feedforward networks for image classification with data-efficient training,” CoRR, vol. abs/2105.03404. [Online]. Available: http://arxiv.org/abs/2105.03404
  7. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” CoRR, vol. abs/2010.11929. [Online]. Available: http://arxiv.org/abs/2010.11929
  8. G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” CoRR, vol. abs/1207.0580. [Online]. Available: http://arxiv.org/abs/1207.0580
  9. G. Ghiasi, T.-Y. Lin, and Q. V. Le, “DropBlock: A regularization method for convolutional networks,” CoRR, vol. abs/1810.12890. [Online]. Available: http://arxiv.org/abs/1810.12890
  10. Z. Zhang, P. Yang, X. Ren, Q. Su, and X. Sun, “Memorized sparse backpropagation,” vol. 415, pp. 397–407. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231220313357
  11. J. Liu, Z. Xu, R. Shi, R. C. C. Cheung, and H. K. H. So, “Dynamic Sparse Training: Find Efficient Sparse Network From Scratch With Trainable Masked Layers,” CoRR, vol. abs/2005.06870. [Online]. Available: http://arxiv.org/abs/2005.06870
  12. M. A. Raihan and T. Aamodt, “Sparse Weight Activation Training,” in Advances in Neural Information Processing Systems, vol. 33.   Curran Associates, Inc., pp. 15 625–15 638. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/b44182379bf9fae976e6ae5996e13cd8-Abstract.html
  13. H. Borras, B. Klein, and H. Fröning, “Walking noise: Understanding implications of noisy computations on classification tasks,” CoRR, vol. abs/2212.10430, Dec 2022. [Online]. Available: https://arxiv.org/abs/2212.10430
  14. T. Krieger, B. Klein, and H. Fröning, “Towards hardware-specific automatic compression of neural networks,” CoRR, vol. abs/2212.07818, Dec 2022. [Online]. Available: https://arxiv.org/abs/2212.07818
  15. C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning (still) requires rethinking generalization,” Communications of the ACM, vol. 64, no. 3, pp. 107–115, Feb. 2021. [Online]. Available: https://dl.acm.org/doi/10.1145/3446776
  16. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling Laws for Neural Language Models,” CoRR, vol. abs/2001.08361. [Online]. Available: http://arxiv.org/abs/2001.08361
  17. Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein, and J. Gonzalez, “Train Big, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers,” in Proceedings of the 37th International Conference on Machine Learning.   PMLR, pp. 5958–5968. [Online]. Available: https://proceedings.mlr.press/v119/li20m.html
  18. P. Tillet, H. T. Kung, and D. Cox, “Triton: An intermediate language and compiler for tiled neural network computations,” in Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages - MAPL 2019.   ACM Press, pp. 10–19. [Online]. Available: http://dl.acm.org/citation.cfm?doid=3315508.3329973
  19. D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization and encoding,” CoRR, vol. abs/1610.02132, 2017. [Online]. Available: http://arxiv.org/abs/1610.02132
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.