FP8-LM: Training FP8 Large Language Models
Abstract: In this paper, we explore FP8 low-bit data formats for efficient training of LLMs. Our key insight is that most variables, such as gradients and optimizer states, in LLM training can employ low-precision data formats without compromising model accuracy and requiring no changes to hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision framework for training LLMs. This framework offers three levels of FP8 utilization to streamline mixed-precision and distributed parallel training for LLMs. It gradually incorporates 8-bit gradients, optimizer states, and distributed learning in an incremental manner. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 39% reduction in real memory usage but also ran 75% faster than the widely adopted BF16 framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer Engine by 37%. This largely reduces the training costs for large foundation models. Furthermore, our FP8 mixed-precision training methodology is generic. It can be seamlessly applied to other tasks such as LLM instruction tuning and reinforcement learning with human feedback, offering savings in fine-tuning expenses. Our FP8 low-precision training framework is open-sourced at {https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Microsoft Bing. Bing webmaster tools. 2022. URL https://www.bing.com/webmasters/.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
- Gpt-neox-20b: An open-source autoregressive language model. In Proceedings of BigScience Episode# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136, 2022.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
- PaLM: Scaling language modeling with pathways. ArXiv, abs/2204.02311, 2022.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https://aclanthology.org/N19-1300.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- 8-bit optimizers via block-wise quantization. In International Conference on Learning Representations, 2021.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423.
- Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR, 2022.
- Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
- Training compute-optimal large language models. arXiv:2203.15556, 2022.
- Binarized neural networks. Advances in neural information processing systems, 29, 2016.
- HuggingFace. wikipedia - datasets at hugging face. 2022. URL https://huggingface.co/datasets/wikipedia.
- Data movement is all you need: A case study on optimizing transformers. Proceedings of Machine Learning and Systems, 3:711–732, 2021.
- Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431, 2017.
- Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, San Diego, CA, 2015. URL http://arxiv.org/abs/1412.6980.
- The stack: 3 tb of permissively licensed source code. Transactions on Machine Learning Research, 2022.
- Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
- Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, 2022.
- Colossal-ai: A unified deep learning system for large-scale parallel training. In Proceedings of the 52nd International Conference on Parallel Processing, pages 766–775, 2023a.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
- Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 1, 2021.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
- Fp8 formats for deep learning. arXiv preprint arXiv:2209.05433, 2022.
- Microsoft. Azure high-performance computing. 2023. URL https://azure.microsoft.com/en-us/solutions/high-performance-computing.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, 2018.
- Nvidia. Apex. 2018. URL https://nvidia.github.io/apex.
- Nvidia. The nvidia collective communications library. 2020. URL https://developer.nvidia.com/nccl.
- Nvidia. Nvidia h100 tensor core gpu architecture. 2022a. URL https://resources.nvidia.com/en-us-tensor-core.
- Nvidia. Nvidia transformer engine. 2022b. URL https://docs.nvidia.com/deeplearning/transformer-engine/index.html.
- Nvidia. Using fp8 with transformer engine. 2022c. URL https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html.
- OpenAI. Model index for researchers. 2022. URL https://platform.openai.com/docs/model-index-for-researchers.
- OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, 2016.
- Shawn Presser. Books3. https://twitter.com/theshawwn/status/1320282149329784833, 2020.
- Language models are unsupervised multitask learners. 2019.
- Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, 2019.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer, 2016.
- Redpajama. Redpajama-data: an open source recipe to reproduce llama training dataset. 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI Spring Symposium, 2011.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations, 2018.
- BLOOM: A 176B-parameter open-access multilingual language model. ArXiv, abs/2211.05100, 2022.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- ShareGPT. Openchat: Advancing open-source language models with imperfect data. 2023. URL https://sharegpt.com/.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Hybrid 8-bit floating point (hfp8) training and inference for deep neural networks. Advances in neural information processing systems, 32, 2019.
- Ultra-low precision 4-bit training of deep neural networks. Advances in Neural Information Processing Systems, 33:1796–1807, 2020.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- Jörg Tiedemann. Finding alternative translations in a large corpus of movie subtitle. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3518–3522, 2016.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847, 2018.
- Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
- VicunaTeam. Vicuna: An open-source chatbot impressing gpt-4 with 90quality. 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Training deep neural networks with 8-bit floating point numbers. Advances in neural information processing systems, 31, 2018.
- Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019.
- XLNet: Generalized autoregressive pretraining for language understanding. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472.
- Glm-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, 2022.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.