BlockPerm-SJLT: GPU Sparse Sketching
- The paper introduces BlockPerm-SJLT, demonstrating a tunable trade-off between statistical embedding quality and GPU efficiency for large-scale applications.
- Methodologically, it leverages block-permutation with sparse SJLT blocks to achieve strong Oblivious Subspace Embedding guarantees and controlled error bounds.
- Empirically, the FlashSketch CUDA kernel delivers 4×–5× speedups on modern GPUs, optimizing the balance between sketching accuracy and runtime performance.
BlockPerm-SJLT is a family of sparse sketching matrices purposefully designed for efficient implementation on modern GPUs while maintaining strong approximation guarantees in the randomized numerical linear algebra (RandNLA) context. The construction generalizes the sparse Johnson-Lindenstrauss transform (SJLT) by introducing a tunable parameter that dictates the trade-off between statistical quality (mixing and embedding guarantees under the oblivious subspace embedding (OSE) framework) and systems efficiency (GPU locality and memory bandwidth usage). This design enables a high-performance CUDA kernel, FlashSketch, delivering improved speed–accuracy Pareto frontiers for large-scale applications (Dwaraknath et al., 2 Feb 2026).
1. Formal Construction of BlockPerm-SJLT
Given input dimension and sketch dimension , select an integer dividing both, such that and for block sizes and . The sketching matrix is partitioned into blocks, with each block . The block connectivity follows a parameterized family:
- Fix , the block-permutation parameter, denoted in (Dwaraknath et al., 2 Feb 2026).
- Draw edge-disjoint permutations with for .
- For output block-row , define a neighborhood .
- For each block with , place an independent small SJLT with exactly nonzeros per column ( at random row locations).
- Block entries elsewhere are set to zero, and each nonzero block is further scaled by .
- Overall, every column of has exactly nonzeros of magnitude .
This construction interpolates the space between block-diagonal (for ) and fully-mixed (for ) architectures.
2. Theoretical Guarantees: Oblivious Subspace Embedding (OSE)
BlockPerm-SJLT satisfies an OSE guarantee governed by block-structure-induced "neighborhood coherence" . For any fixed with orthonormal columns, define
where stacks the row-blocks for . With , the OSE theorem asserts the existence of absolute constants such that if
then with probability at least , the embedding property
holds for all , i.e., is an -OSE for the column space of (Dwaraknath et al., 2 Feb 2026).
3. Impact of Block-Permutation Parameter
The trade-off between sketch quality and GPU locality is governed by :
- Large : Promotes thorough mixing (lower ), making it possible to achieve the OSE for smaller , but results in more global memory accesses as each output block aggregates multiple remote input blocks.
- Small : Yields a highly local sketch (block-diagonal when ), maximizing memory locality but requiring larger to attain the same embedding quality due to reduced inter-block mixing.
The mathematical derivation leverages the energy identity
and analyzes tail bounds inherited from the constituent SJLT blocks to relate , , to distortion and concentration. determines both the "mixing" among blocks (statistical quality, via ) and per-column sparsity (systems cost).
4. Efficient GPU Implementation via FlashSketch
FlashSketch is a CUDA kernel tailored to BlockPerm-SJLT's structural regularity:
- Tiling: Operates on a 2D grid of thread-blocks indexed by output row-block and input tile , leveraging shared memory for fast sub-matrix operations.
- On-the-fly computation: Computes using lightweight affine-mod- recurrences and injects sparse updates via fast 32-bit hashing, avoiding the storage of an explicit .
- Atomic elimination: All atomic add operations are confined to shared memory within thread-blocks, avoiding expensive global atomics; final global writes are coalesced.
- Bandwidth and arithmetic complexity: Each input element is read times; the memory traffic scales as , arithmetic as per element, and overall wall-time is governed by GPU occupancy and shared-memory bandwidth.
- Parameter tuning: , , and GPU tile sizes are tuned to optimize system throughput and accuracy (Dwaraknath et al., 2 Feb 2026).
5. Empirical Behavior and Pareto-Optimality
BlockPerm-SJLT, via FlashSketch, demonstrates superior empirical performance on NVIDIA RTX 4090 and A6000 GPUs for key linear algebra and ML workloads:
- Benchmarks: Gram-matrix approximation, OSE spectral error, sketch-and-ridge regression, sketch-and-solve least squares, and GraSS ML data-attribution pipeline.
- Observations:
- Adjusting traces a continuous Pareto frontier between sketching accuracy and runtime.
- At moderate error targets, FlashSketch achieves – speedup over previous GPU SJLT kernels and a global geometric-mean speedup of .
- For , FlashSketch matches or outperforms dense Gaussian projections (cuBLAS) and FHT-based SRHT in speed and, at times, accuracy.
- In GraSS pipelines, projection time per example is reduced by up to with unchanged downstream LDS metrics (Dwaraknath et al., 2 Feb 2026).
6. Comparative Summary and Scope of BlockPerm-SJLT
BlockPerm-SJLT unifies the design space for GPU-friendly sparse sketches with a single, interpretable parameter that interpolates between full locality and complete mixing. This enables both theoretical OSE guarantees parameterized by and high-efficiency systems implementation. FlashSketch kernel design capitalizes on this regularity for conflict-free parallelism, pushing the speed–accuracy boundary across key RandNLA tasks and scalable ML pipelines, as measured on state-of-the-art GPU hardware.
| Parameter | Effect on System | Effect on Statistics |
|---|---|---|
| Small | High locality, fast | Poor mixing, more distortion |
| Large | Higher bandwidth use | Excellent mixing, lower distortion |
The BlockPerm-SJLT framework provides a rigorous, tunable approach to large-scale sketching for contemporary GPU architectures while maintaining the theoretical embedding properties central to randomized linear algebra (Dwaraknath et al., 2 Feb 2026).