Fast Kronecker Matrix-Matrix Multiplication on GPUs

Published 18 Jan 2024 in cs.DC | (2401.10187v3)

Abstract: Kronecker Matrix-Matrix Multiplication (Kron-Matmul) is the multiplication of a matrix with the Kronecker Product of several smaller matrices. Kron-Matmul is a core operation for many scientific and machine learning computations. State-of-the-art Kron-Matmul implementations utilize existing tensor algebra operations, such as matrix multiplication, transpose, and tensor matrix multiplication. However, this design choice prevents several Kron-Matmul specific optimizations, thus, leaving significant performance on the table. To address this issue, we present FastKron, an efficient technique for Kron-Matmul on single and multiple GPUs. FastKron is independent of linear algebra operations enabling several new optimizations for Kron-Matmul. Thus, it performs up to 40.7x and 7.85x faster than existing implementations on 1 and 16 GPUs respectively.

Abstract PDF HTML Upgrade to Chat

References (53)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel algorithm that bypasses standard linear algebra operations for efficient Kron-Matmul on single and multi-GPU systems, achieving up to 40.7× speedup.
It employs advanced CUDA optimizations including tiling, shared memory caching, and kernel fusion to eliminate costly transpose operations and reduce memory overhead.
Integration into GPyTorch demonstrates practical impact by accelerating Gaussian Process training with up to 6.2× reduction in training time on multi-GPU setups.

Fast Kronecker Matrix-Matrix Multiplication on GPUs

This paper introduces \sysname{}, a novel approach to Kronecker Matrix-Matrix Multiplication (Kron-Matmul) designed for both single and multi-GPU architectures. Unlike existing methods that rely on standard linear algebra operations, \sysname{} employs a specialized algorithm that enables significant performance optimizations, achieving up to 40.7 $\times$ speedup on a single GPU and 7.85 $\times$ on a multi-GPU system compared to state-of-the-art implementations.

Background and Motivation

Kron-Matmul is a fundamental operation in various scientific computing and machine learning applications, including Gaussian Processes (GPs). Existing algorithms, such as the shuffle algorithm and the Fused Tensor Matrix Multiply Transpose (FTMMT), depend on linear algebra operations like matrix multiplication, transpose, and tensor matrix multiplication. This reliance limits potential optimizations specific to Kron-Matmul, leading to inefficiencies such as:

High transpose costs, accounting for up to 80\% of execution time.
Suboptimal performance of linear algebra kernels on small, rectangular matrices.
Redundant global memory accesses due to full intermediate storage at each iteration.
High communication volume in multi-GPU implementations due to frequent intermediate exchanges.

The \sysname{} Algorithm

\sysname{} addresses the limitations of existing algorithms by introducing a novel approach that bypasses linear algebra primitives, enabling specific optimizations for Kron-Matmul. The core of \sysname{} involves dividing rows of the input matrix into slices and multiplying each slice with all columns of the factor matrices (Figure 1). Consecutive elements in the intermediate matrix are generated by multiplying consecutive slices with the same column. This design eliminates the need for transpose or reshape operations, which are major bottlenecks in existing methods.

(Figure 1)

Figure 1: First iteration of the shuffle algorithm for Kron-Matmul of $\X_{2\times 4}$ and $\F{1}_{2\times 2} \kron \F{2}_{2\times 2}$, illustrating the reshape and transpose operations.

The computational complexity of \sysname{} is $\complexity{\XM\FP\sum_{\ii=1}^{\N}\FQ^{\N-\ii}\FP^{\ii}$, with memory accesses of $\complexity{\XM\sum_{\ii=1}^{\N}\FQ^{\N-\ii}\FP^{\ii}$, resulting in a computation-to-memory access ratio of $\FP$. This favorable ratio, combined with the elimination of transpose operations, contributes to \sysname{}'s superior performance. Figure 2 illustrates the sliced multiply approach.

(Figure 2)

Figure 2: First iteration of the \sysname{} Kron-Matmul algorithm of $\X_{2\times 4}$ with $\F{1}_{2\times 2} \kron \F{2}_{2\times 2}$, demonstrating the sliced-multiply operation.

CUDA Implementation Details

\sysname{}'s GPU implementation incorporates several key optimizations:

Tiling: Assigns multiple slices and columns to each thread, enhancing data reuse and parallelism.
Shared Memory Caching: Caches inputs in shared memory while minimizing bank conflicts and ensuring coalesced global memory accesses.
Kernel Fusion: Fuses multiplications with multiple factors into a single GPU kernel by storing intermediates in shared memory, reducing global memory accesses.
Shift Caching: Minimizes shared memory bank conflicts by strategically shifting elements during caching.

The CUDA kernel's workflow involves loading slices of rows and columns into shared memory, transferring portions to registers, performing sliced multiply-accumulate operations, and writing results back to global memory. The tiling strategy, exemplified in Figure 3, divides the input and factor matrices into blocks processed by individual thread blocks.

(Figure 4)

Figure 4: Thread block 0 is assigned to 1$^{\text{st}$ row of $\X{}$ and 2 cols of $\Fn$ to produce $\frac{512}{8} \times 2$ = 128 elements of $\Y{}$.

The shift caching technique is particularly notable, as it reduces shared memory bank conflicts compared to direct caching methods. The autotuning mechanism automatically selects optimal tile sizes for different Kron-Matmul shapes.

Distributed Kron-Matmul

For multi-GPU systems, \sysname{} minimizes communication volume by performing multiple local multiplications on each GPU before exchanging intermediates. The algorithm distributes the computation across a 2D grid of GPUs, dividing the input matrix into blocks processed by individual GPUs. Local sliced multiplications are performed, and intermediates are communicated using NVIDIA NCCL for efficient data transfer. The element distribution across GPUs is structured to optimize communication efficiency (Figure 3).

Figure 3: The element distribution of local intermediates of all 4 GPUs for Kron-Matmul on $\X{}_{1\times 256}$ with 4 factors $\Fn_{4\times 4}$ with $\left\{\GPUM, \GPUK\right\} = \left\{1, 4\right\}$.

Performance Evaluation

The performance of \sysname{} was evaluated against state-of-the-art implementations, including GPyTorch, COGENT, cuTensor, CTF, and Distal. Microbenchmarks demonstrated significant speedups on a single GPU, with \sysname{} achieving up to 87% of the GPU's maximum FLOPS. The fusion optimization contributed substantially to performance gains, particularly for smaller factor sizes. Compared to GPyTorch, \sysname{} achieved speedups ranging from 7.62 $\times$ to 3.11 $\times$ , attributed to the elimination of transpose operations and more efficient matrix multiplication. Relative to COGENT and cuTensor, \sysname{} achieved speedups of up to 6.40 $\times$ and 5.41 $\times$ , respectively, due to kernel fusion and improved shared memory access patterns. Weak scaling experiments on a 16-GPU system showed \sysname{} outperforming CTF and Distal by 7.85 $\times$ and 5.33 $\times$ , respectively.

Application to Gaussian Processes

\sysname{} was integrated into GPyTorch to accelerate the training of Gaussian Processes (GPs) using Structured Kernel Interpolation (SKI). The integration resulted in training time reductions of up to 6.20 $\times$ on a 16-GPU system, demonstrating the practical impact of \sysname{} in a real-world application.

Conclusion

\sysname{} presents a highly optimized approach to Kron-Matmul on GPUs, achieving substantial performance improvements over existing methods. The key innovations include a novel algorithm that eliminates transpose operations, an efficient CUDA implementation with tiling, shared memory caching, and kernel fusion, and a distributed algorithm that minimizes communication volume on multi-GPU systems. The integration of \sysname{} into GPyTorch showcases its potential to accelerate machine learning applications based on Gaussian Processes. Future work could explore the application of \sysname{} to other domains that rely heavily on Kronecker matrix operations, and further optimization of the CUDA kernels for specific hardware architectures.

Markdown Report Issue