Papers
Topics
Authors
Recent
Search
2000 character limit reached

Training Transformers with 4-bit Integers

Published 21 Jun 2023 in cs.LG and cs.NE | (2306.11987v2)

Abstract: Quantizing the activation, weight, and gradient to 4-bit is promising to accelerate neural network training. However, existing 4-bit training methods require custom numerical formats which are not supported by contemporary hardware. In this work, we propose a training method for transformers with all matrix multiplications implemented with the INT4 arithmetic. Training with an ultra-low INT4 precision is challenging. To achieve this, we carefully analyze the specific structures of activation and gradients in transformers to propose dedicated quantizers for them. For forward propagation, we identify the challenge of outliers and propose a Hadamard quantizer to suppress the outliers. For backpropagation, we leverage the structural sparsity of gradients by proposing bit splitting and leverage score sampling techniques to quantize gradients accurately. Our algorithm achieves competitive accuracy on a wide range of tasks including natural language understanding, machine translation, and image classification. Unlike previous 4-bit training methods, our algorithm can be implemented on the current generation of GPUs. Our prototypical linear operator implementation is up to 2.2 times faster than the FP16 counterparts and speeds up the training by up to 35.1%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Faster neural network training with approximate tensor operations. arXiv preprint arXiv:1805.08079, 2018.
  2. Binarybert: Pushing the limit of bert quantization. arXiv preprint arXiv:2012.15701, 2020.
  3. Scalable methods for 8-bit training of neural networks. In Advances in Neural Information Processing Systems, pages 5145–5153, 2018.
  4. Beat the ai: Investigating adversarial human annotation for reading comprehension. Transactions of the Association for Computational Linguistics, 8:662–678, 2020.
  5. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  6. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pages 12–58, 2014.
  7. A statistical framework for low-bitwidth training of deep neural networks. In Advances in neural information processing systems, 2020.
  8. Logarithmic unbiased quantization: Practical 4-bit training in deep learning. arXiv preprint arXiv:2112.10769, 2021.
  9. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
  10. Rethinking attention with performers. In International Conference on Learning Representations, 2020.
  11. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. arXiv preprint arXiv:1911.03852, 2019.
  12. Hawq: Hessian aware quantization of neural networks with mixed-precision. ICCV, 2019.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  14. Randnla: randomized numerical linear algebra. Communications of the ACM, 59(6):80–90, 2016.
  15. Training dnns with hybrid block floating point. In Advances in Neural Information Processing Systems, pages 453–463, 2018.
  16. Learned step size quantization. In International Conference on Learning Representations, 2019.
  17. Reducing transformer depth on demand with structured dropout. In International Conference on Learning Representations, 2019.
  18. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
  19. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4852–4861, 2019.
  20. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  21. Deep networks with stochastic depth. In European conference on computer vision, pages 646–661. Springer, 2016.
  22. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
  23. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2704–2713, 2018.
  24. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  25. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  26. Reformer: The efficient transformer. In International Conference on Learning Representations, 2019.
  27. Learning multiple layers of features from tiny images. Technical report, 2009.
  28. Deep learning training on the edge with low-precision posits. arXiv preprint arXiv:1907.13216, 2019.
  29. Cheetah: Mixed low-precision hardware & software co-design framework for dnns on the edge. arXiv preprint arXiv:1908.02386, 2019.
  30. How do adam and training strategies help bnns optimization. In International Conference on Machine Learning, pages 6936–6946. PMLR, 2021.
  31. Reactnet: Towards precise binary neural network with generalized activation functions. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 143–159. Springer, 2020.
  32. Mixed precision training. In International Conference on Learning Representations, 2018.
  33. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
  34. Nvidia. Transformer Engine. https://github.com/NVIDIA/TransformerEngine, 2023. Online; accessed 23 January 2023.
  35. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  36. Matt Post. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771, 2018.
  37. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
  38. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  39. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
  40. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  41. Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050, 2003.
  42. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  43. Q-bert: Hessian based ultra low precision quantization of bert. arXiv preprint arXiv:1909.05840, 2019.
  44. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8815–8821, 2020.
  45. Hybrid 8-bit floating point (hfp8) training and inference for deep neural networks. In Advances in Neural Information Processing Systems, pages 4901–4910, 2019.
  46. Ultra-low precision 4-bit training of deep neural networks. In Advances in Neural Information Processing Systems, volume 33, 2020.
  47. James Joseph Sylvester. Lx. thoughts on inverse orthogonal matrices, simultaneous signsuccessions, and tessellated pavements in two or more colours, with applications to newton’s rule, ornamental tile-work, and the theory of numbers. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 34(232):461–475, 1867.
  48. Mkq-bert: Quantized bert with 4-bits weights and activations. arXiv preprint arXiv:2203.13483, 2022.
  49. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021.
  50. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
  51. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  52. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  53. Training deep neural networks with 8-bit floating point numbers. In Advances in Neural Information Processing Systems, pages 7675–7684, 2018.
  54. Squat: Sharpness-and quantization-aware training for bert. arXiv preprint arXiv:2210.07171, 2022.
  55. Outlier suppression: Pushing the limit of low-bit transformer language models. arXiv preprint arXiv:2209.13325, 2022.
  56. Training and inference with integers in deep neural networks. In International Conference on Learning Representations, 2018.
  57. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
  58. Training high-performance and large-scale deep neural networks with full 8-bit integers. Neural Networks, 125:70–82, 2020.
  59. Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pages 36–39. IEEE, 2019.
  60. Swag: A large-scale adversarial dataset for grounded commonsense inference. arXiv preprint arXiv:1808.05326, 2018.
  61. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
  62. LQ-Nets: Learned quantization for highly accurate and compact deep neural networks. In The European Conference on Computer Vision (ECCV), September 2018.
  63. Ternarybert: Distillation-aware ultra-low bit bert. arXiv preprint arXiv:2009.12812, 2020.
  64. Adaptive precision training: Quantify back propagation in neural networks with fixed-point numbers. arXiv preprint arXiv:1911.00361, 2019.
  65. Improving neural network quantization without retraining using outlier channel splitting. In International conference on machine learning, pages 7543–7552. PMLR, 2019.
  66. Incremental network quantization: Towards lossless cnns with low-precision weights. International Conference on Learning Representations, 2017.
  67. Towards unified int8 training for convolutional neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1969–1979, 2020.
Citations (30)

Summary

  • The paper introduces an INT4 quantization approach that leverages Hadamard and bit splitting techniques, reducing training time by up to 35.1% without significant accuracy loss.
  • It employs specialized quantizers for forward and backward propagation, resulting in up to 2.2 times faster linear operations on modern GPUs.
  • The research sets a precedent for ultra-low precision neural network training, promising advancements in efficient AI models for resource-constrained environments.

An Analysis of "Training Transformers with 4-bit Integers"

The paper "Training Transformers with 4-bit Integers" by Haocheng Xi et al. presents a novel approach to training transformer models using 4-bit integer (INT4) arithmetic. By focusing on minimizing the numerical precision necessary for computations, the authors aim to enhance the efficiency of neural network training without a significant loss in performance accuracy. This process leverages contemporary GPU hardware capabilities, making the implementation of this reduced precision feasible and yielding substantial computational benefits.

Methodology and Techniques

The core challenge addressed by the authors is the reduction of numerical precision to 4 bits while maintaining competitive performance metrics. Traditionally, training requires significant computational resources, typically using FP32 arithmetic. By decreasing this precision, computational and memory demands can be significantly lowered.

One of the significant contributions of the work is the design of dedicated quantizers for transformers. These quantizers address both forward and backward propagations:

  • Forward Propagation: The authors utilize a Hadamard quantizer to address the problem of activation outliers, which are values significantly larger than the average that can dominate and reduce the training effectiveness. By transforming the data using a block-diagonal Hadamard matrix, they spread outlier information across nearby entries, reducing their impact.
  • Backward Propagation: For gradient computation, the authors exploit the sparsity commonly found in gradient matrices by employing bit splitting and leverage score sampling. While gradients are notoriously difficult to accurately represent at low precision, these techniques allow the effective quantization of significant gradients while preserving computational resources.

Empirical Results

The proposed methods show significant performance improvements across multiple robust transformers-based tasks, including natural language understanding, machine translation, and image classification. Notably, the INT4 implementations on GPUs demonstrated up to 2.2 times faster linear operations than their FP16 counterparts, with overall training time decreased by up to 35.1%.

Implications and Future Directions

From a practical standpoint, this research presents a significant step forward in reducing the computational overhead of deep learning models, specifically transformers. The successful implementation of transformers under INT4 precision without custom hardware indicates the potential for widespread applicability, particularly in environments where computational resources are limited.

Theoretically, this method of reduced precision training challenges the current reliance on high precision arithmetic, encouraging further exploration into the limits of low-bit computation. This approach could open new research avenues into efficient neural network designs that inherently adapt to reduced precision, potentially leading to new architectures streamlined for ultra-low precision computations.

Future directions might include extending these techniques to convolutional neural networks and other deep learning architectures that are widely used in fields like image processing and speech recognition. Moreover, applying these reduced precision methods to large-scale LLMs, which currently pose challenges even at higher precisions such as INT8, could offer insights into further optimizing these expansive networks.

Conclusion

The paper's contribution to the field of quantized neural network training is substantial, providing both theoretical insights and practical implementations. It sets a precedent for the adoption of ultra-low precision computation in the training of deep neural networks, highlighting not only potential computational benefits but also the robustness of transformer models even under constrained numerical representations. This work underscores the transformative potential of INT4 arithmetic in the landscape of efficient AI model training.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 2 likes about this paper.