Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fast Kronecker Matrix-Matrix Multiplication on GPUs

Published 18 Jan 2024 in cs.DC | (2401.10187v3)

Abstract: Kronecker Matrix-Matrix Multiplication (Kron-Matmul) is the multiplication of a matrix with the Kronecker Product of several smaller matrices. Kron-Matmul is a core operation for many scientific and machine learning computations. State-of-the-art Kron-Matmul implementations utilize existing tensor algebra operations, such as matrix multiplication, transpose, and tensor matrix multiplication. However, this design choice prevents several Kron-Matmul specific optimizations, thus, leaving significant performance on the table. To address this issue, we present FastKron, an efficient technique for Kron-Matmul on single and multiple GPUs. FastKron is independent of linear algebra operations enabling several new optimizations for Kron-Matmul. Thus, it performs up to 40.7x and 7.85x faster than existing implementations on 1 and 16 GPUs respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Accessed: 2022-07-30. NVIDIA cuBLAS. https://developer.nvidia.com/cublas.
  2. Accessed: 2023-07-30. NVIDIA cuTLASS: CUDA Templates for Linear Algebra Subroutines. https://github.com/NVIDIA/cutlass.
  3. Accessed: 2023-07-30. NVIDIA NCCL: Optimized primitives for collective multi-GPU communication. https://github.com/NVIDIA/nccl.
  4. Accessed: 2023-07-30. UCI ML Dataset. https://archive.ics.uci.edu/datasets.
  5. KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators. ACM Trans. Math. Softw., Article 18 (2016). https://doi.org/10.1145/2818311
  6. PyKronecker: A Python Library for the Efficient Manipulation of Kronecker Products and Related Structures. Journal of Open Source Software 8 (2023). https://doi.org/10.21105/joss.04900
  7. Multi-task Gaussian Process Prediction. In Advances in Neural Information Processing Systems, J. Platt, D. Koller, Y. Singer, and S. Roweis (Eds.). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2007/file/66368270ffd51418ec58bd793f2d9b1b-Paper.pdf
  8. Lynn Elliot Cannon. 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph. D. Dissertation. USA.
  9. Rong Chen Chencheng Cai and Han Xiao. 2023. Hybrid Kronecker Product Decomposition and Approximation. Journal of Computational and Graphical Statistics 32 (2023). https://doi.org/10.1080/10618600.2022.2134873
  10. Marc Davio. 1981. Kronecker products and shuffle algebra. IEEE Trans. Comput. C-30 (1981). https://doi.org/10.1109/TC.1981.6312174
  11. Tuǧrul Dayar and M. Can Orhan. 2015. On Vector-Kronecker Product Multiplication with Rectangular Factors. SIAM Journal on Scientific Computing 37, 5 (2015). https://doi.org/10.1137/140980326
  12. Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. https://doi.org/10.1109/IPDPS.2013.80
  13. Distributed Halide. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. https://doi.org/10.1145/2851141.2851157
  14. Product Kernel Interpolation for Scalable Gaussian Processes. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 84). PMLR. https://proceedings.mlr.press/v84/gardner18a.html
  15. GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (Montréal, Canada) (NIPS’18).
  16. dCUDA: Hardware Supported Overlap of Computation and Communication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. https://doi.org/10.1109/SC.2016.51
  17. Mathematical modeling of multiple pathways in colorectal carcinogenesis using dynamical systems with Kronecker structure. PLOS Computational Biology (2021). https://doi.org/10.1371/journal.pcbi.1008970
  18. Efficient dense matrix-vector multiplication on GPU. Concurrency and Computation: Practice and Experience 30 (2018). https://doi.org/10.1002/cpe.4705
  19. Antti-Pekka Hynninen and Dmitry I. Lyakh. 2017. cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs. CoRR abs/1705.01598 (2017). arXiv:1705.01598
  20. Abhinav Jangda. 2023. (Artifact) Fast Kronecker Matrix Multiplication on GPUs. (12 2023). https://doi.org/10.6084/m9.figshare.24803229.v1
  21. Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS ’22). https://doi.org/10.1145/3503222.3507778
  22. Kronecker Recurrent Units. CoRR abs/1705.10142 (2017). http://arxiv.org/abs/1705.10142
  23. Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs. In Proceedings of the 2018 International Conference on Supercomputing (Beijing, China) (ICS ’18). https://doi.org/10.1145/3205289.3205296
  24. A Code Generator for High-Performance Tensor Contractions on GPUs. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). https://doi.org/10.1109/CGO.2019.8661182
  25. Red-Blue Pebbling Revisited: Near Optimal Parallel Matrix-Matrix Multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’19). https://doi.org/10.1145/3295500.3356181
  26. Amy N. Langville and William J. Stewart. 2004. The Kronecker Product and Stochastic Automata Networks. J. Comput. Appl. Math. 167, 2 (jun 2004). https://doi.org/10.1016/j.cam.2003.10.010
  27. Generalized Cannon’s Algorithm for Parallel Matrix Multiplication. In Proceedings of the 11th International Conference on Supercomputing (Vienna, Austria) (ICS ’97). https://doi.org/10.1145/263580.263591
  28. Kronecker Graphs: An Approach to Modeling Networks. J. Mach. Learn. Res. 11 (2010).
  29. Dmitry Lyakh. 2015. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia Tesla GPU. Computer Physics Communications 189 (2015). https://doi.org/10.1016/j.cpc.2014.12.013
  30. Optimizing tensor contraction expressions for hybrid CPU-GPU execution. Cluster Computing 16 (Mar 2013). https://doi.org/10.1007/s10586-011-0179-2
  31. Devin A. Matthews. 2018. High-Performance Tensor Contraction without Transposition. SIAM Journal on Scientific Computing 40 (2018). https://doi.org/10.1137/16M108968X
  32. Generating Efficient Tensor Contractions for GPUs. In 2015 44th International Conference on Parallel Processing. https://doi.org/10.1109/ICPP.2015.106
  33. Predictive Data Locality Optimization for Higher-Order Tensor Computations. In Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming (Virtual, Canada) (MAPS 2021). https://doi.org/10.1145/3460945.3464955
  34. Constant-Time Predictive Distributions for Gaussian Processes. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research). PMLR. https://proceedings.mlr.press/v80/pleiss18a.html
  35. A Communication-Optimal Framework for Contracting Distributed Tensors. In SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. https://doi.org/10.1109/SC.2014.36
  36. Carl Edward Rasmussen and Christopher K. I. Williams. 2005. Gaussian Processes for Machine Learning. The MIT Press. https://doi.org/10.7551/mitpress/3206.001.0001
  37. Carl Edward Rasmussen and Christopher K. I. Williams. 2006. Gaussian processes for machine learning. MIT Press.
  38. TSM2X: High-performance tall-and-skinny matrix-matrix multiplication on GPUs. J. Parallel and Distrib. Comput. 151 (2021). https://doi.org/10.1016/j.jpdc.2021.02.013
  39. Tensor Contractions with Extended BLAS Kernels on CPU and GPU. In 2016 IEEE 23rd International Conference on High Performance Computing (HiPC). https://doi.org/10.1109/HiPC.2016.031
  40. Edgar Solomonik and James Demmel. 2011. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms. In Euro-Par 2011 Parallel Processing. https://doi.org/10.1007/978-3-642-23397-5_10
  41. A Massively Parallel Tensor Contraction Framework for Coupled-Cluster Computations. J. Parallel Distrib. Comput. 74 (2014). https://doi.org/10.1016/j.jpdc.2014.06.002
  42. Paul Springer and Paolo Bientinesi. 2018. Design of a High-Performance GEMM-like Tensor-Tensor Multiplication. ACM Trans. Math. Softw. 44 (2018). https://doi.org/10.1145/3157733
  43. TTC: A Tensor Transposition Compiler for Multiple Architectures. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (Santa Barbara, CA, USA). https://doi.org/10.1145/2935323.2935328
  44. HPTT: A High-Performance Tensor Transposition C++ Library. In Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming (Barcelona, Spain) (ARRAY 2017). https://doi.org/10.1145/3091966.3091968
  45. Doping: A technique for Extreme Compression of LSTM Models using Sparse Structured Additive Matrices. In Proceedings of Machine Learning and Systems 2021, MLSys 2021, Alex Smola, Alex Dimakis, and Ion Stoica (Eds.). mlsys.org.
  46. R. A. Van De Geijn and J. Watts. 1997. SUMMA: scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9 (1997). https://doi.org/10.1002/(SICI)1096-9128(199704)9:4<255::AID-CPE250>3.0.CO;2-2
  47. Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM Trans. Math. Software 41 (2015). https://doi.acm.org/10.1145/2764454
  48. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions. arXiv:1802.04730 [cs.PL]
  49. TTLG - An Efficient Tensor Transposition Library for GPUs. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). https://doi.org/10.1109/IPDPS.2018.00067
  50. Generalized vec trick for fast learning of pairwise kernel models. Machine Learning 111 (2022). https://doi.org/10.1007/s10994-021-06127-y
  51. Thoughts on Massively Scalable Gaussian Processes. arXiv:1511.01870 [cs.LG]
  52. Andrew Gordon Wilson and Hannes Nickisch. 2015. Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP). In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML’15).
  53. DISTAL: The Distributed Tensor Algebra Compiler. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (San Diego, CA, USA) (PLDI 2022). https://doi.org/10.1145/3519939.3523437
Citations (1)

Summary

  • The paper introduces a novel algorithm that bypasses standard linear algebra operations for efficient Kron-Matmul on single and multi-GPU systems, achieving up to 40.7× speedup.
  • It employs advanced CUDA optimizations including tiling, shared memory caching, and kernel fusion to eliminate costly transpose operations and reduce memory overhead.
  • Integration into GPyTorch demonstrates practical impact by accelerating Gaussian Process training with up to 6.2× reduction in training time on multi-GPU setups.

Fast Kronecker Matrix-Matrix Multiplication on GPUs

This paper introduces \sysname{}, a novel approach to Kronecker Matrix-Matrix Multiplication (Kron-Matmul) designed for both single and multi-GPU architectures. Unlike existing methods that rely on standard linear algebra operations, \sysname{} employs a specialized algorithm that enables significant performance optimizations, achieving up to 40.7×\times speedup on a single GPU and 7.85×\times on a multi-GPU system compared to state-of-the-art implementations.

Background and Motivation

Kron-Matmul is a fundamental operation in various scientific computing and machine learning applications, including Gaussian Processes (GPs). Existing algorithms, such as the shuffle algorithm and the Fused Tensor Matrix Multiply Transpose (FTMMT), depend on linear algebra operations like matrix multiplication, transpose, and tensor matrix multiplication. This reliance limits potential optimizations specific to Kron-Matmul, leading to inefficiencies such as:

  • High transpose costs, accounting for up to 80\% of execution time.
  • Suboptimal performance of linear algebra kernels on small, rectangular matrices.
  • Redundant global memory accesses due to full intermediate storage at each iteration.
  • High communication volume in multi-GPU implementations due to frequent intermediate exchanges.

The \sysname{} Algorithm

\sysname{} addresses the limitations of existing algorithms by introducing a novel approach that bypasses linear algebra primitives, enabling specific optimizations for Kron-Matmul. The core of \sysname{} involves dividing rows of the input matrix into slices and multiplying each slice with all columns of the factor matrices (Figure 1). Consecutive elements in the intermediate matrix are generated by multiplying consecutive slices with the same column. This design eliminates the need for transpose or reshape operations, which are major bottlenecks in existing methods.

(Figure 1)

Figure 1: First iteration of the shuffle algorithm for Kron-Matmul of $\X_{2\times 4}$ and $\F{1}_{2\times 2} \kron \F{2}_{2\times 2}$, illustrating the reshape and transpose operations.

The computational complexity of \sysname{} is $\complexity{\XM\FP\sum_{\ii=1}^{\N}\FQ^{\N-\ii}\FP^{\ii}$, with memory accesses of $\complexity{\XM\sum_{\ii=1}^{\N}\FQ^{\N-\ii}\FP^{\ii}$, resulting in a computation-to-memory access ratio of $\FP$. This favorable ratio, combined with the elimination of transpose operations, contributes to \sysname{}'s superior performance. Figure 2 illustrates the sliced multiply approach.

(Figure 2)

Figure 2: First iteration of the \sysname{} Kron-Matmul algorithm of $\X_{2\times 4}$ with $\F{1}_{2\times 2} \kron \F{2}_{2\times 2}$, demonstrating the sliced-multiply operation.

CUDA Implementation Details

\sysname{}'s GPU implementation incorporates several key optimizations:

  • Tiling: Assigns multiple slices and columns to each thread, enhancing data reuse and parallelism.
  • Shared Memory Caching: Caches inputs in shared memory while minimizing bank conflicts and ensuring coalesced global memory accesses.
  • Kernel Fusion: Fuses multiplications with multiple factors into a single GPU kernel by storing intermediates in shared memory, reducing global memory accesses.
  • Shift Caching: Minimizes shared memory bank conflicts by strategically shifting elements during caching.

The CUDA kernel's workflow involves loading slices of rows and columns into shared memory, transferring portions to registers, performing sliced multiply-accumulate operations, and writing results back to global memory. The tiling strategy, exemplified in Figure 3, divides the input and factor matrices into blocks processed by individual thread blocks.

(Figure 4)

Figure 4: Thread block 0 is assigned to 1$^{\text{st}$ row of $\X{}$ and 2 cols of $\Fn$ to produce 5128×2\frac{512}{8} \times 2 = 128 elements of $\Y{}$.

The shift caching technique is particularly notable, as it reduces shared memory bank conflicts compared to direct caching methods. The autotuning mechanism automatically selects optimal tile sizes for different Kron-Matmul shapes.

Distributed Kron-Matmul

For multi-GPU systems, \sysname{} minimizes communication volume by performing multiple local multiplications on each GPU before exchanging intermediates. The algorithm distributes the computation across a 2D grid of GPUs, dividing the input matrix into blocks processed by individual GPUs. Local sliced multiplications are performed, and intermediates are communicated using NVIDIA NCCL for efficient data transfer. The element distribution across GPUs is structured to optimize communication efficiency (Figure 3). Figure 3

Figure 3

Figure 3: The element distribution of local intermediates of all 4 GPUs for Kron-Matmul on $\X{}_{1\times 256}$ with 4 factors $\Fn_{4\times 4}$ with $\left\{\GPUM, \GPUK\right\} = \left\{1, 4\right\}$.

Performance Evaluation

The performance of \sysname{} was evaluated against state-of-the-art implementations, including GPyTorch, COGENT, cuTensor, CTF, and Distal. Microbenchmarks demonstrated significant speedups on a single GPU, with \sysname{} achieving up to 87% of the GPU's maximum FLOPS. The fusion optimization contributed substantially to performance gains, particularly for smaller factor sizes. Compared to GPyTorch, \sysname{} achieved speedups ranging from 7.62×\times to 3.11×\times, attributed to the elimination of transpose operations and more efficient matrix multiplication. Relative to COGENT and cuTensor, \sysname{} achieved speedups of up to 6.40×\times and 5.41×\times, respectively, due to kernel fusion and improved shared memory access patterns. Weak scaling experiments on a 16-GPU system showed \sysname{} outperforming CTF and Distal by 7.85×\times and 5.33×\times, respectively.

Application to Gaussian Processes

\sysname{} was integrated into GPyTorch to accelerate the training of Gaussian Processes (GPs) using Structured Kernel Interpolation (SKI). The integration resulted in training time reductions of up to 6.20×\times on a 16-GPU system, demonstrating the practical impact of \sysname{} in a real-world application.

Conclusion

\sysname{} presents a highly optimized approach to Kron-Matmul on GPUs, achieving substantial performance improvements over existing methods. The key innovations include a novel algorithm that eliminates transpose operations, an efficient CUDA implementation with tiling, shared memory caching, and kernel fusion, and a distributed algorithm that minimizes communication volume on multi-GPU systems. The integration of \sysname{} into GPyTorch showcases its potential to accelerate machine learning applications based on Gaussian Processes. Future work could explore the application of \sysname{} to other domains that rely heavily on Kronecker matrix operations, and further optimization of the CUDA kernels for specific hardware architectures.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 41 likes about this paper.