Training Transformers with 4-bit Integers
Abstract: Quantizing the activation, weight, and gradient to 4-bit is promising to accelerate neural network training. However, existing 4-bit training methods require custom numerical formats which are not supported by contemporary hardware. In this work, we propose a training method for transformers with all matrix multiplications implemented with the INT4 arithmetic. Training with an ultra-low INT4 precision is challenging. To achieve this, we carefully analyze the specific structures of activation and gradients in transformers to propose dedicated quantizers for them. For forward propagation, we identify the challenge of outliers and propose a Hadamard quantizer to suppress the outliers. For backpropagation, we leverage the structural sparsity of gradients by proposing bit splitting and leverage score sampling techniques to quantize gradients accurately. Our algorithm achieves competitive accuracy on a wide range of tasks including natural language understanding, machine translation, and image classification. Unlike previous 4-bit training methods, our algorithm can be implemented on the current generation of GPUs. Our prototypical linear operator implementation is up to 2.2 times faster than the FP16 counterparts and speeds up the training by up to 35.1%.
- Faster neural network training with approximate tensor operations. arXiv preprint arXiv:1805.08079, 2018.
- Binarybert: Pushing the limit of bert quantization. arXiv preprint arXiv:2012.15701, 2020.
- Scalable methods for 8-bit training of neural networks. In Advances in Neural Information Processing Systems, pages 5145–5153, 2018.
- Beat the ai: Investigating adversarial human annotation for reading comprehension. Transactions of the Association for Computational Linguistics, 8:662–678, 2020.
- Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pages 12–58, 2014.
- A statistical framework for low-bitwidth training of deep neural networks. In Advances in neural information processing systems, 2020.
- Logarithmic unbiased quantization: Practical 4-bit training in deep learning. arXiv preprint arXiv:2112.10769, 2021.
- Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
- Rethinking attention with performers. In International Conference on Learning Representations, 2020.
- Hawq-v2: Hessian aware trace-weighted quantization of neural networks. arXiv preprint arXiv:1911.03852, 2019.
- Hawq: Hessian aware quantization of neural networks with mixed-precision. ICCV, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Randnla: randomized numerical linear algebra. Communications of the ACM, 59(6):80–90, 2016.
- Training dnns with hybrid block floating point. In Advances in Neural Information Processing Systems, pages 453–463, 2018.
- Learned step size quantization. In International Conference on Learning Representations, 2019.
- Reducing transformer depth on demand with structured dropout. In International Conference on Learning Representations, 2019.
- Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
- Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4852–4861, 2019.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Deep networks with stochastic depth. In European conference on computer vision, pages 646–661. Springer, 2016.
- Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
- Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2704–2713, 2018.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- Reformer: The efficient transformer. In International Conference on Learning Representations, 2019.
- Learning multiple layers of features from tiny images. Technical report, 2009.
- Deep learning training on the edge with low-precision posits. arXiv preprint arXiv:1907.13216, 2019.
- Cheetah: Mixed low-precision hardware & software co-design framework for dnns on the edge. arXiv preprint arXiv:1908.02386, 2019.
- How do adam and training strategies help bnns optimization. In International Conference on Machine Learning, pages 6936–6946. PMLR, 2021.
- Reactnet: Towards precise binary neural network with generalized activation functions. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 143–159. Springer, 2020.
- Mixed precision training. In International Conference on Learning Representations, 2018.
- Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
- Nvidia. Transformer Engine. https://github.com/NVIDIA/TransformerEngine, 2023. Online; accessed 23 January 2023.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Matt Post. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771, 2018.
- Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
- Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
- Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050, 2003.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- Q-bert: Hessian based ultra low precision quantization of bert. arXiv preprint arXiv:1909.05840, 2019.
- Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8815–8821, 2020.
- Hybrid 8-bit floating point (hfp8) training and inference for deep neural networks. In Advances in Neural Information Processing Systems, pages 4901–4910, 2019.
- Ultra-low precision 4-bit training of deep neural networks. In Advances in Neural Information Processing Systems, volume 33, 2020.
- James Joseph Sylvester. Lx. thoughts on inverse orthogonal matrices, simultaneous signsuccessions, and tessellated pavements in two or more colours, with applications to newton’s rule, ornamental tile-work, and the theory of numbers. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 34(232):461–475, 1867.
- Mkq-bert: Quantized bert with 4-bits weights and activations. arXiv preprint arXiv:2203.13483, 2022.
- Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- Training deep neural networks with 8-bit floating point numbers. In Advances in Neural Information Processing Systems, pages 7675–7684, 2018.
- Squat: Sharpness-and quantization-aware training for bert. arXiv preprint arXiv:2210.07171, 2022.
- Outlier suppression: Pushing the limit of low-bit transformer language models. arXiv preprint arXiv:2209.13325, 2022.
- Training and inference with integers in deep neural networks. In International Conference on Learning Representations, 2018.
- Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
- Training high-performance and large-scale deep neural networks with full 8-bit integers. Neural Networks, 125:70–82, 2020.
- Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pages 36–39. IEEE, 2019.
- Swag: A large-scale adversarial dataset for grounded commonsense inference. arXiv preprint arXiv:1808.05326, 2018.
- Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
- LQ-Nets: Learned quantization for highly accurate and compact deep neural networks. In The European Conference on Computer Vision (ECCV), September 2018.
- Ternarybert: Distillation-aware ultra-low bit bert. arXiv preprint arXiv:2009.12812, 2020.
- Adaptive precision training: Quantify back propagation in neural networks with fixed-point numbers. arXiv preprint arXiv:1911.00361, 2019.
- Improving neural network quantization without retraining using outlier channel splitting. In International conference on machine learning, pages 7543–7552. PMLR, 2019.
- Incremental network quantization: Towards lossless cnns with low-precision weights. International Conference on Learning Representations, 2017.
- Towards unified int8 training for convolutional neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1969–1979, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.