Adaptive Row-grouped CSR (AR-CSR)
- The paper introduces AR-CSR, a novel sparse matrix storage format that significantly improves GPU SpMV performance by addressing load imbalance and uncoalesced memory accesses.
- It utilizes hierarchical row grouping and fixed-size chunking to distribute workload evenly across CUDA threads, leading to efficient parallel execution.
- Empirical results demonstrate up to 10× speed-up over CUSPARSE with minimal storage overhead, making AR-CSR ideal for irregular sparse matrices.
Adaptive Row-grouped CSR (AR-CSR) is an adaptive sparse matrix storage format designed for efficient sparse matrix-vector multiplication (SpMV) on GPUs. It addresses the primary performance bottlenecks of standard CSR and CUSPARSE formats—namely, uncoalesced memory access due to high variance in row length and load imbalance across threads—by introducing hierarchical row grouping and chunking strategies. AR-CSR achieves significantly higher SpMV throughput for a broad class of irregular sparse matrices after a one-time format conversion from standard CSR (Heller et al., 2012).
1. Motivation and Comparative Foundations
Sparse matrix-vector multiplication (SpMV) is a key computational kernel in scientific computing. Standard CSR (Compressed Sparse Row) storage—comprising values, column indices, and row pointer arrays—assigns one GPU thread per matrix row. This approach is efficient for matrices with near-constant row lengths but suffers from severe load imbalance and scattered memory accesses in the presence of highly variable row lengths. CUSPARSE, the widely used CUDA library, adopts a similar thread-per-row approach but incorporates tuned kernels for latency hiding and improved but non-ideal memory coalescing. Neither CSR nor CUSPARSE can fundamentally eliminate thread divergence and the resultant performance degradation in real-world, irregular matrices.
AR-CSR introduces multi-row grouping and intra-group parallel chunking. Consecutive rows are aggregated into groups, each processed by a CUDA block. Individual threads within a block process fixed-size "chunks," allowing long rows to be split among several threads and addressing both workload disparity and memory access inefficiency (Heller et al., 2012).
2. Data Structures and Memory Layout
The AR-CSR format structures a matrix with non-zeros into groups, with each group assigned to a CUDA block of threads. The primary data arrays and mapping structures are:
- for , encoding each group’s starting row, size, global offset in the data arrays, and chunk size.
- , a prefix sum array indicating, for each row, the cumulative count of threads assigned to all preceding rows.
- and , real-valued and integer arrays of length , storing the grouped matrix entries and their column indices; padding is denoted by the sentinel value 0 for columns.
Within each group 1, the memory layout is organized such that for chunk (thread) 2 and position 3:
4
where 5 and 6 is defined by the thread-to-row mapping.
This organization enables fully coalesced memory accesses for both values and column indices within each block and distributes work among threads proportionally to the nnz-per-row profile (Heller et al., 2012).
3. Conversion Algorithm from CSR to AR-CSR
AR-CSR construction from existing CSR-formatted data proceeds as follows:
- Partitioning Rows into Groups: Sequentially process rows, accumulating the local nnz and row count until either exceeds 7 (desired chunk size times block size) or the maximum block size is met; this marks a group boundary.
- Thread Allocation within Groups: Initially assign one thread per row; then, iteratively distribute remaining threads (until all 8 are used) to rows with the largest reduction in per-thread nnz after additional assignment, aiming to minimize intra-group chunk size variation.
- Prefix-Sum Calculation: Compute exclusive prefix-sums of thread allocations within each group for efficient chunk-to-row mappings.
- Populating Data Arrays: For each group, thread, and chunk position, copy matrix elements from CSR (or pad with artificial zeros if needed) into the 9 and 0 arrays.
The time complexity is 1, and the space overhead is 2. Conversion typically incurs only a few milliseconds for large matrices and is amortized over many SpMV calls in iterative solvers (Heller et al., 2012).
4. SpMV Kernel Architecture and Execution Model
The SpMV operation in AR-CSR launches 3 CUDA blocks of 4 threads each. Group metadata and thread-to-row mappings are accessed from shared memory. Each thread computes the partial sum of its chunk, leveraging strided, coalesced accesses for optimal bandwidth utilization. Inter-thread reductions within each group aggregate per-thread partial sums for rows spanned by multiple threads.
Key characteristics of the AR-CSR kernel include:
- Perfect load balancing: Each CUDA block handles exactly 5 chunks; work is evenly distributed by adapting the chunk size and per-row thread count.
- Coalesced memory reads: By construction, threads within a warp access contiguous memory regions.
- Shared memory efficiency: Thread mappings and partial sums occupy shared memory 6.
- Tunable parameters: Block size 7 (best at 8 on Tesla C2070) and desired chunk size 9; these allow optimization to matrix structure and hardware (Heller et al., 2012).
5. Performance Evaluation and Empirical Results
On a dataset of 0 matrices, AR-CSR was benchmarked against standard CPU-CSR and CUDA CUSPARSE implementations on a Tesla C2070 (144 GB/s, double precision), with the CPU baseline being CSR on an AMD Phenom II X6. For robust performance, 1 and 2 were used unless otherwise tuned.
- Peak observed performance: 3 GFLOPS on "Schenk_AFE" (structural problem) at 4; 5 GFLOPS for 6.
- Median observed performance: 7 GFLOPS with 8.
- Relative speed-ups:
- Faster than CPU-CSR on 9 matrices
- Faster than CUSPARSE on 0 matrices, with peak speed-ups up to 1 (e.g., "rajat23").
- Best-case matrices: High variance in row length (common in circuit simulation, e.g., "raj," "rajat," "IBM_EDA") and mixed patterns (very long rows among short ones).
- Worst-case/scenarios favoring alternatives: Near-constant row lengths ("mesh" matrices), smaller problems (2k rows), or regular sparsity patterns.
Summary of results:
| Format | # Matrices Faster | Median Speed-up | Peak GFLOPS |
|---|---|---|---|
| CPU-CSR | — | 3 | 4 |
| CUSPARSE | 5 | 6 | 7 |
| AR-CSR (8) | 9 | 0 | 1 |
[Table as in (Heller et al., 2012)]
6. Usage Guidelines, Limitations, and Practical Considerations
- Conversion and Storage: AR-CSR requires a one-time conversion from CSR, with computational cost 2. The storage overhead, including mapping structures and group metadata plus padding, is typically below 3.
- Suitability: Best applied when SpMV is invoked repeatedly, such as in iterative Krylov or multigrid methods. For highly variable row lengths, set 4; for more regular matrices, larger 5 (up to average nnz/row) may be optimal.
- Limitations: Not efficient for matrices with very regular sparsity or for very small matrices, where conversion overheads outweigh runtime benefits. The format must be rebuilt if matrix 6 is significantly altered.
- Parameter Tuning: Empirical selection of 7 and 8 is recommended, as performance depends on the matrix profile and hardware characteristics.
AR-CSR demonstrates that the trade-off of minor conversion and storage overhead for substantial runtime acceleration is highly advantageous in many real-world applications involving heterogeneous sparse matrices, routinely surpassing both classic CSR and tuned vendor-provided libraries like CUSPARSE for SpMV on GPUs (Heller et al., 2012).