MACKO-SpMV: Efficient GPU SpMV for Pruned LLMs
- MACKO-SpMV is a GPU-oriented sparse matrix–vector multiplication format designed for pruned LLMs with moderate sparsity (30–90%), using a compressive coordinate data structure.
- It uses a warp-centric GPU kernel that synchronously decodes small fixed-width delta values and leverages aligned memory reads to optimize performance.
- Empirical results show significant memory savings and speedups—up to 5.62× over dense methods—demonstrating its practical impact on LLM inference.
MACKO-SpMV is a GPU-oriented sparse matrix–vector multiplication (SpMV) format and corresponding kernel, specifically designed to efficiently support the unstructured and moderate sparsity (30–90%) characteristic of pruned LLMs. Prior approaches, including standard CSR, bitmask, and specialized GPU SpMV libraries such as cuSPARSE, Sputnik, and DASP, suffer from insufficient memory savings and often achieve sub-dense throughput in this low and unstructured sparsity regime. MACKO-SpMV introduces a compressive coordinate data format together with a warp-centric kernel tailored to minimize storage overhead and maximize memory bandwidth utilization, leading to significant empirical speedups and reductions in memory footprint without requiring specialized hardware or precomputation (Macko et al., 17 Nov 2025).
1. Data Format and Storage Layout
MACKO’s storage scheme represents a sparse matrix with density using a “CSR-like” design, but instead of full-width column indices, encodes the difference between consecutive column indices (“deltas”) in small fixed-width bit fields (). The structure comprises three primary arrays:
- values: FP16 entries (or other value width ), row-major, padded to maintain alignment.
- deltas: delta values packed into 8-bit words, each bits, encoding column-index differences.
- row_ptr: () 32-bit offsets for each row, analogous to CSR.
Padding is introduced as necessary so that each row’s values and deltas are perfectly aligned, enabling locked-step GPU vector loads. During SpMV, the kernel reads aligned blocks from both arrays, reconstructs absolute column indices via prefix-summing the deltas, and multiplies by the corresponding vector element. The format enables efficient vectorized loads and warp-cooperative computation.
The MACKO storage size in bytes is
The effective density $\effd_\mathrm{MACKO}$, defined as stored bytes to hypothetical dense bytes ratio, can be exactly and asymptotically bounded in best/worst/i.i.d. random scenarios. For the default configuration (FP16 values, 4-bit deltas), $\effd_\mathrm{MACKO} < 1$ for , i.e., sparsity —outperforming CSR16/32 and bitmask-based representations throughout the 30–90% sparsity range.
2. GPU Kernel and Warp Coordination
The MACKO-SpMV GPU kernel is co-designed for this data layout and modern GPU architecture. It adapts the SplitK GEMM mapping (one warp per output row). Each warp of 32 CUDA threads is responsible for computing a single output row. In each iteration, all threads synchronously:
- Vector-load 128 bytes (32 x four-bit deltas) for index decoding,
- Vector-load 512 bytes (32 x eight FP16 values) for computation,
- Both streams are locked-step and 128B-aligned using Reverse–Offset Memory Alignment (ROMA).
Each thread:
- Computes prefix sums over its assigned deltas,
- Participates in warp-level prefix sum (using CUDA
__shfl_syncoperations) to reconstruct absolute column indices, - Gathers input vector values at those indices,
- Accumulates FMA results for its portion of the row,
- Participates in a warp-reduction to finalize the row's dot product.
This design allows the kernel to fully utilize GPU memory bandwidth and minimizes thread divergence or redundant computation, as all padding and alignment are handled a priori in the format.
3. Analytical Performance Analysis
Both dense MV and SpMV on modern GPUs are constrained predominantly by bandwidth rather than compute, as formalized via the compute intensity () metric. For SpMV, $CI_\mathrm{SpMV} \approx d / \effd$. Hence, a lower $\effd$ directly leads to faster runtime, assuming the operation is bandwidth bound.
Direct comparison with other formats demonstrates that, for typical LLM sparsity, MACKO's effective density is strictly less than CSR16, CSR32, and bitmask formats for . Theoretically, under the pure bandwidth model, the speedup over other formats and dense representations is proportional to their effective density ratio. For example, at , , $\effd_\mathrm{MACKO}^{\mathrm{exp}} \approx 0.28$, suggesting an ideal speedup over dense of approximately . Empirically, overheads in kernel execution reduce observed gains to $1.2$– vs. dense but bandwidth advantage persists as the main driver.
4. Empirical Evaluation
Experiments with LLM-scale matrices (as large as in FP16) on RTX 2080 SUPER, 3090, and 4090, benchmark MACKO against cuBLAS (dense), cuSPARSE, Sputnik, and DASP. Runtimes are measured end-to-end, with cold caches, and averaged across 1000 iterations. Performance at various sparsities is as follows:
| Sparsity (\%) | Speedup vs cuBLAS | Speedup vs cuSPARSE | Speedup vs Sputnik | Speedup vs DASP |
|---|---|---|---|---|
| 30 | 0.97× | 5.1× | 2.4× | 2.8× |
| 50 | 1.30× | 9.0× | 2.2× | 2.5× |
| 70 | 1.96× | 13.0× | 2.6× | 2.3× |
| 90 | 5.62× | >20× | 3.5× | 3.0× |
Throughput measurements with Llama2-7B (FP16), for pruned sparsity levels, show MACKO is unique in exceeding dense throughput at practical sparsity (≥25%), reaching dense throughput at 50%, and at 70%. Memory reduction ratios at these points are commensurately and .
5. Application to LLM Inference
MACKO is integrated into Llama2-7B’s PyTorch inference loop, after 50% unstructured pruning via Wanda. On RTX4090 at FP16:
- Model memory drops from 13.59 GB (dense) to 8.87 GB ( reduction).
- End-to-end generation (100-token batch) throughput rises from 66.5 t/s (dense) to 98.6 t/s ( speedup).
- Per-token decode latency decreases from approximately 15 ms to 10 ms.
No specialized hardware or nonstandard CUDA features are required. There is no need for format-specific precomputation apart from basic row pointer initialization; compute overhead remains minimal. As a result, unstructured 50% pruning becomes immediately beneficial for model serving in existing commodity GPU environments, in contrast with prior approaches where speedups were rarely realized at such low sparsity.
6. Implications, Significance, and Prospects
MACKO-SpMV is the first established format to provide rigorous and bounded storage overhead together with bandwidth-realizing performance in the 30–90% sparsity regime relevant for structured and unstructured model pruning. Empirical results show that, especially at 50% sparsity, memory and time advantages reach practical thresholds for LLM deployment. This advances the case for routine unstructured pruning in real-world inference, directly countering the previously limited benefits ascribed to such sparsity.
The co-design of data format and GPU kernel ensures hardware-compatibility and future extensibility. While the focus is on FP16 and 4-bit delta configurations, the framework generalizes to other low-precision and wide delta settings, subject to similar alignment and packing constraints. A plausible implication is that, as LLMs continue to scale and pruning continues to be essential for resource-efficient inference, such format-kernel joint optimization will become central to software stack development in ML deployment.