The Case for Co-Designing Model Architectures with Hardware
Abstract: While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models. As a consequence, modifying a DL model to be more amenable to the target hardware can significantly improve the runtime performance of DL training and inference. In this paper, we provide a set of guidelines for users to maximize the runtime performance of their transformer models. These guidelines have been created by carefully considering the impact of various model hyperparameters controlling model shape on the efficiency of the underlying computation kernels executed on the GPU. We find the throughput of models with efficient model shapes is up to 39\% higher while preserving accuracy compared to models with a similar number of parameters but with unoptimized shapes.
- “DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale”, 2022 arXiv:2207.00032 [cs.LG]
- “GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch”, GitHub Repo, 2023 URL: https://www.github.com/eleutherai/gpt-neox
- “Pythia: A suite for analyzing large language models across training and scaling” In International Conference on Machine Learning, 2023, pp. 2397–2430 PMLR
- “GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow” In If you use this software, please cite it using these metadata 58, 2021
- “Language Models are Few-Shot Learners” In Advances in Neural Information Processing Systems 33, 2020, pp. 1877–1901
- “PaLM: Scaling Language Modeling with Pathways” Version 5 In Computing Research Repository, 2022 URL: https://arxiv.org/abs/2204.02311v5
- Together Computer “RedPajama: an Open Dataset for Training Large Language Models”, 2023 URL: https://github.com/togethercomputer/RedPajama-Data
- Tri Dao “Flashattention-2: Faster attention with better parallelism and work partitioning” In arXiv preprint arXiv:2307.08691, 2023
- “Flashattention: Fast and memory-efficient exact attention with io-awareness” In Advances in Neural Information Processing Systems 35, 2022, pp. 16344–16359
- “DeepSpeed-MII”, https://github.com/microsoft/DeepSpeed-MII, 2022
- “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, 2019 arXiv:1810.04805 [cs.CL]
- “A Comprehensive Evaluation of Novel AI Accelerators for Deep Learning Workloads” In 2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2022, pp. 13–25
- “TurboTransformers: An Efficient GPU Serving System for Transformer Models” In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’21 Virtual Event, Republic of Korea: Association for Computing Machinery, 2021, pp. 389–402 URL: https://doi.org/10.1145/3437801.3441578
- “FasterTransformer”, https://github.com/NVIDIA/FasterTransformer, 2021
- “Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers” In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, 2018, pp. 603–613
- Horace He “Let’s talk about a detail that occurs during PyTorch 2.0’s codegen - tiling.” Twitter, 2023 URL: https://x.com/cHHillee/status/1620878972547665921
- Andrej Karpathy “The most dramatic optimization to nanoGPT so far ( 25% speedup) is to simply increase vocab size from 50257 to 50304 (nearest multiple of 64).” Twitter, 2023 URL: https://x.com/karpathy/status/1621578354024677377
- “XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs” In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Los Alamitos, CA, USA: IEEE Computer Society, 2020, pp. 326–327 URL: https://doi.ieeecomputersociety.org/10.1109/IPDPS47924.2020.00042
- “Roberta: A robustly optimized bert pretraining approach” In arXiv preprint arXiv:1907.11692, 2019
- “A survey of techniques for optimizing deep learning on GPUs” In Journal of Systems Architecture 99, 2019, pp. 101635
- “MLPerf” Accessed: January 25, 2024, https://mlperf.org/, 2023
- “A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes”, 2021 arXiv:2102.06356 [cs.LG]
- “Efficient Large-Scale Language Model Training on GPU Clusters using Megatron-LM” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021
- NVIDIA “Matrix Multiplication Background”, User’s Guide — NVIDIA Docs, 2023 URL: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html
- OLCF “OLCF6 Technical Requirements and Benchmarks”, 2023
- Ofir Press, Noah Smith and Mike Lewis “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation” In International Conference on Learning Representations, 2021
- “Language models are unsupervised multitask learners” In OpenAI blog 1.8, 2019, pp. 9
- “Exploring the limits of transfer learning with a unified text-to-text transformer” In The Journal of Machine Learning Research 21.1 JMLRORG, 2020, pp. 5485–5551
- Md Aamir Raihan, Negar Goli and Tor M. Aamodt “Modeling Deep Learning Accelerator Enabled GPUs” In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2018, pp. 79–92 URL: https://api.semanticscholar.org/CorpusID:53783076
- Noam Shazeer “Glu variants improve transformer” In arXiv preprint arXiv:2002.05202, 2020
- “Megatron-LM: Training Multi-Billion Parameter Language Models using GPU Model Parallelism” In arXiv preprint arXiv:1909.08053, 2019
- “Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model” In arXiv preprint arXiv:2201.11990, 2022
- “Roformer: Enhanced transformer with rotary position embedding” In arXiv preprint arXiv:2104.09864, 2021
- Yuhsiang Mike Tsai, Terry Cojean and Hartwig Anzt “Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse Linear Algebra Computations”, 2020 arXiv:2008.08478 [cs.MS]
- “Attention is All You Need” In Advances in Neural Information Processing Systems 30, 2017
- “GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model”, 2021
- Yu Emma Wang, Gu-Yeon Wei and David M. Brooks “Benchmarking TPU, GPU, and CPU Platforms for Deep Learning” In ArXiv abs/1907.10701, 2019 URL: https://api.semanticscholar.org/CorpusID:198894674
- Da Yan, Wei Wang and Xiaowen Chu “Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply” In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020, pp. 634–643
- “Comparative evaluation of deep learning workloads for leadership-class systems” In BenchCouncil Transactions on Benchmarks, Standards and Evaluations 1.1, 2021, pp. 100005 URL: https://www.sciencedirect.com/science/article/pii/S2772485921000053
- “ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs” In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Los Alamitos, CA, USA: IEEE Computer Society, 2023, pp. 344–355
- “Opt: Open pre-trained transformer language models” In arXiv preprint arXiv:2205.01068, 2022
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.