Papers
Topics
Authors
Recent
Search
2000 character limit reached

The Case for Co-Designing Model Architectures with Hardware

Published 25 Jan 2024 in cs.DC and cs.AI | (2401.14489v2)

Abstract: While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models. As a consequence, modifying a DL model to be more amenable to the target hardware can significantly improve the runtime performance of DL training and inference. In this paper, we provide a set of guidelines for users to maximize the runtime performance of their transformer models. These guidelines have been created by carefully considering the impact of various model hyperparameters controlling model shape on the efficiency of the underlying computation kernels executed on the GPU. We find the throughput of models with efficient model shapes is up to 39\% higher while preserving accuracy compared to models with a similar number of parameters but with unoptimized shapes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. “DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale”, 2022 arXiv:2207.00032 [cs.LG]
  2. “GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch”, GitHub Repo, 2023 URL: https://www.github.com/eleutherai/gpt-neox
  3. “Pythia: A suite for analyzing large language models across training and scaling” In International Conference on Machine Learning, 2023, pp. 2397–2430 PMLR
  4. “GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow” In If you use this software, please cite it using these metadata 58, 2021
  5. “Language Models are Few-Shot Learners” In Advances in Neural Information Processing Systems 33, 2020, pp. 1877–1901
  6. “PaLM: Scaling Language Modeling with Pathways” Version 5 In Computing Research Repository, 2022 URL: https://arxiv.org/abs/2204.02311v5
  7. Together Computer “RedPajama: an Open Dataset for Training Large Language Models”, 2023 URL: https://github.com/togethercomputer/RedPajama-Data
  8. Tri Dao “Flashattention-2: Faster attention with better parallelism and work partitioning” In arXiv preprint arXiv:2307.08691, 2023
  9. “Flashattention: Fast and memory-efficient exact attention with io-awareness” In Advances in Neural Information Processing Systems 35, 2022, pp. 16344–16359
  10. “DeepSpeed-MII”, https://github.com/microsoft/DeepSpeed-MII, 2022
  11. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, 2019 arXiv:1810.04805 [cs.CL]
  12. “A Comprehensive Evaluation of Novel AI Accelerators for Deep Learning Workloads” In 2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2022, pp. 13–25
  13. “TurboTransformers: An Efficient GPU Serving System for Transformer Models” In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’21 Virtual Event, Republic of Korea: Association for Computing Machinery, 2021, pp. 389–402 URL: https://doi.org/10.1145/3437801.3441578
  14. “FasterTransformer”, https://github.com/NVIDIA/FasterTransformer, 2021
  15. “Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers” In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, 2018, pp. 603–613
  16. Horace He “Let’s talk about a detail that occurs during PyTorch 2.0’s codegen - tiling.” Twitter, 2023 URL: https://x.com/cHHillee/status/1620878972547665921
  17. Andrej Karpathy “The most dramatic optimization to nanoGPT so far ( 25% speedup) is to simply increase vocab size from 50257 to 50304 (nearest multiple of 64).” Twitter, 2023 URL: https://x.com/karpathy/status/1621578354024677377
  18. “XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs” In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Los Alamitos, CA, USA: IEEE Computer Society, 2020, pp. 326–327 URL: https://doi.ieeecomputersociety.org/10.1109/IPDPS47924.2020.00042
  19. “Roberta: A robustly optimized bert pretraining approach” In arXiv preprint arXiv:1907.11692, 2019
  20. “A survey of techniques for optimizing deep learning on GPUs” In Journal of Systems Architecture 99, 2019, pp. 101635
  21. “MLPerf” Accessed: January 25, 2024, https://mlperf.org/, 2023
  22. “A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes”, 2021 arXiv:2102.06356 [cs.LG]
  23. “Efficient Large-Scale Language Model Training on GPU Clusters using Megatron-LM” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021
  24. NVIDIA “Matrix Multiplication Background”, User’s Guide — NVIDIA Docs, 2023 URL: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html
  25. OLCF “OLCF6 Technical Requirements and Benchmarks”, 2023
  26. Ofir Press, Noah Smith and Mike Lewis “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation” In International Conference on Learning Representations, 2021
  27. “Language models are unsupervised multitask learners” In OpenAI blog 1.8, 2019, pp. 9
  28. “Exploring the limits of transfer learning with a unified text-to-text transformer” In The Journal of Machine Learning Research 21.1 JMLRORG, 2020, pp. 5485–5551
  29. Md Aamir Raihan, Negar Goli and Tor M. Aamodt “Modeling Deep Learning Accelerator Enabled GPUs” In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2018, pp. 79–92 URL: https://api.semanticscholar.org/CorpusID:53783076
  30. Noam Shazeer “Glu variants improve transformer” In arXiv preprint arXiv:2002.05202, 2020
  31. “Megatron-LM: Training Multi-Billion Parameter Language Models using GPU Model Parallelism” In arXiv preprint arXiv:1909.08053, 2019
  32. “Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model” In arXiv preprint arXiv:2201.11990, 2022
  33. “Roformer: Enhanced transformer with rotary position embedding” In arXiv preprint arXiv:2104.09864, 2021
  34. Yuhsiang Mike Tsai, Terry Cojean and Hartwig Anzt “Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse Linear Algebra Computations”, 2020 arXiv:2008.08478 [cs.MS]
  35. “Attention is All You Need” In Advances in Neural Information Processing Systems 30, 2017
  36. “GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model”, 2021
  37. Yu Emma Wang, Gu-Yeon Wei and David M. Brooks “Benchmarking TPU, GPU, and CPU Platforms for Deep Learning” In ArXiv abs/1907.10701, 2019 URL: https://api.semanticscholar.org/CorpusID:198894674
  38. Da Yan, Wei Wang and Xiaowen Chu “Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply” In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020, pp. 634–643
  39. “Comparative evaluation of deep learning workloads for leadership-class systems” In BenchCouncil Transactions on Benchmarks, Standards and Evaluations 1.1, 2021, pp. 100005 URL: https://www.sciencedirect.com/science/article/pii/S2772485921000053
  40. “ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs” In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Los Alamitos, CA, USA: IEEE Computer Society, 2023, pp. 344–355
  41. “Opt: Open pre-trained transformer language models” In arXiv preprint arXiv:2205.01068, 2022
Citations (2)

Summary

  • The paper demonstrates that optimizing transformer architecture with GPU-aware hyperparameters can boost throughput by up to 39%.
  • The paper shows that aligning matrix dimensions to multiples of 64 FP16 elements significantly enhances Tensor Core performance.
  • The paper advocates a co-design approach that customizes model parameters to hardware specs, reducing training time and resource waste.

The Case for Co-Designing Model Architectures with Hardware

This paper addresses an often-overlooked aspect of transformer-based deep learning (DL) models: the co-design of model architectures alongside hardware specifications. As transformer models continue to dominate fields such as natural language processing, optimizing their training and inference efficiency is paramount. This paper presents guidelines that promise to enhance the performance of transformers on GPUs by refining model hyperparameters with an understanding of the underlying GPU architecture.

Key Findings

The authors provide compelling evidence that adhering to certain architectural guidelines can lead to significant improvements in GPU throughput without compromising model accuracy. For instance, models optimized for efficient matrix shapes showcased up to a 39% increase in throughput. Such optimization is especially crucial as even minor inefficiencies in hardware utilization can lead to substantial amounts of wasted resources at scale.

Transformer models rely heavily on Operations such as General Matrix Multiplications (GEMMs). The research highlights that ensuring the dimensions of matrices are favorable to Tensor Core operations can considerably boost performance. Specifically, matrix dimensions should align as multiples of 64 FP16 elements for most efficiency on NVIDIA GPUs. Further, understanding and addressing tile and wave quantization effects are emphasized as strategies to minimize compute wastage.

Implications and Future Directions

The paper highlights the necessity of marrying model and hardware design processes, arguing that model dimensions should be customized with hardware specifications at the forefront. This approach could redefine best practices in DL model development, especially as hardware continues to evolve rapidly. Notably, it underlines the importance of considering the architecture of GPUs, primarily streaming multiprocessors (SMs) and Tensor Cores.

From a theoretical standpoint, the work suggests a shift towards a more hardware-conscious model architecture design paradigm. Practically, these guidelines facilitate more efficient usage of computational resources over the model's lifecycle, minimizing training times and inference costs.

Future research could extend this work by evaluating different hardware types, moving beyond GPUs to more exotic accelerators, or considering the implications of varied parallelism strategies. Furthermore, exploring co-design strategies across different DL architectures, such as encoder-decoder models, could offer additional insights.

Conclusion

This study provides a substantial contribution to the DL community by systematically exploring the dependency between model architecture and hardware efficiency. By designing transformers with GPU specifics in mind, researchers and practitioners can significantly enhance the efficiency of model training and deployment. This work serves as a valuable reference for future endeavors in optimizing deep learning operations for cutting-edge hardware.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 19 tweets with 788 likes about this paper.