BlockPerm-SJLT: GPU Sparse Sketching

Updated 9 February 2026

The paper introduces BlockPerm-SJLT, demonstrating a tunable trade-off between statistical embedding quality and GPU efficiency for large-scale applications.
Methodologically, it leverages block-permutation with sparse SJLT blocks to achieve strong Oblivious Subspace Embedding guarantees and controlled error bounds.
Empirically, the FlashSketch CUDA kernel delivers 4×–5× speedups on modern GPUs, optimizing the balance between sketching accuracy and runtime performance.

BlockPerm-SJLT is a family of sparse sketching matrices purposefully designed for efficient implementation on modern GPUs while maintaining strong approximation guarantees in the randomized numerical linear algebra (RandNLA) context. The construction generalizes the sparse Johnson-Lindenstrauss transform (SJLT) by introducing a tunable parameter that dictates the trade-off between statistical quality (mixing and embedding guarantees under the oblivious subspace embedding (OSE) framework) and systems efficiency (GPU locality and memory bandwidth usage). This design enables a high-performance CUDA kernel, FlashSketch, delivering improved speed–accuracy Pareto frontiers for large-scale applications (Dwaraknath et al., 2 Feb 2026).

1. Formal Construction of BlockPerm-SJLT

Given input dimension $d$ and sketch dimension $k$ , select an integer $M$ dividing both, such that $d = M B_c$ and $k = M B_r$ for block sizes $B_c$ and $B_r$ . The sketching matrix $S \in \mathbb R^{k \times d}$ is partitioned into $M \times M$ blocks, with each block $S_{g,h} \in \mathbb R^{B_r \times B_c}$ . The block connectivity follows a parameterized family:

Fix $b \geq 1$ , the block-permutation parameter, denoted $\kappa$ in (Dwaraknath et al., 2 Feb 2026).
Draw $b$ edge-disjoint permutations $\{\pi_\ell: [M] \to [M]\}_{\ell=1}^b$ with $\pi_\ell(g) \neq \pi_{\ell'}(g)$ for $\ell \ne \ell'$ .
For output block-row $g \in [M]$ , define a neighborhood $\mathcal N(g) = \{\pi_1(g), ..., \pi_b(g)\}$ .
For each block $(g,h)$ with $h \in \mathcal N(g)$ , place an independent small SJLT $\Phi_{g,h} \in \mathbb R^{B_r \times B_c}$ with exactly $s$ nonzeros per column ( $\pm 1/\sqrt{s}$ at random row locations).
Block entries elsewhere are set to zero, and each nonzero block is further scaled by $1/\sqrt{b}$ .
Overall, every column of $S$ has exactly $b\,s$ nonzeros of magnitude $1/\sqrt{b\,s}$ .

This construction interpolates the space between block-diagonal (for $b=1$ ) and fully-mixed (for $b=M$ ) architectures.

2. Theoretical Guarantees: Oblivious Subspace Embedding (OSE)

BlockPerm-SJLT satisfies an OSE guarantee governed by block-structure-induced "neighborhood coherence" $\mu_{\mathrm{nbr}}$ . For any fixed $U \in \mathbb R^{d \times r}$ with orthonormal columns, define

$\mu_{\mathrm{nbr}}(U; \pi) = \frac{M}{b} \max_{g \in [M]} \left\| U_{\mathcal N(g)} \right\|_2^2$

where $U_{\mathcal N(g)}$ stacks the row-blocks $U^{(h)}$ for $h \in \mathcal N(g)$ . With $t = r + \log(1/\delta)$ , the OSE theorem asserts the existence of absolute constants $C, c > 0$ such that if

$k \geq C \frac{\mu_{\mathrm{nbr}}(U; \pi)}{\varepsilon^2} t \qquad \text{and} \qquad b\,s \geq C \frac{t}{\varepsilon},$

then with probability at least $1-\delta$ , the embedding property

$\| U^\top S^\top S U - I_r \|_2 \leq \varepsilon$

holds for all $U$ , i.e., $S$ is an $(\varepsilon, \delta)$ -OSE for the column space of $U$ (Dwaraknath et al., 2 Feb 2026).

3. Impact of Block-Permutation Parameter $b$

The trade-off between sketch quality and GPU locality is governed by $b$ :

Large $b$ : Promotes thorough mixing (lower $\mu_{\mathrm{nbr}}$ ), making it possible to achieve the OSE for smaller $k$ , but results in more global memory accesses as each output block aggregates multiple remote input blocks.
Small $b$ : Yields a highly local sketch (block-diagonal when $b=1$ ), maximizing memory locality but requiring larger $k$ to attain the same embedding quality due to reduced inter-block mixing.

The mathematical derivation leverages the energy identity

$\sum_{g=1}^M \| x_{\mathcal N(g)} \|_2^2 = b \| x \|_2^2$

and analyzes tail bounds inherited from the constituent SJLT blocks to relate $k$ , $b$ , $s$ to distortion and concentration. $b$ determines both the "mixing" among blocks (statistical quality, via $\mu_{\mathrm{nbr}}$ ) and per-column sparsity (systems cost).

4. Efficient GPU Implementation via FlashSketch

FlashSketch is a CUDA kernel tailored to BlockPerm-SJLT's structural regularity:

Tiling: Operates on a 2D grid of thread-blocks indexed by output row-block $g$ and input tile $j$ , leveraging shared memory for fast sub-matrix operations.
On-the-fly computation: Computes $\mathcal N(g)$ using lightweight affine-mod- $M$ recurrences and injects sparse updates via fast 32-bit hashing, avoiding the storage of an explicit $S$ .
Atomic elimination: All atomic add operations are confined to shared memory within thread-blocks, avoiding expensive global atomics; final global writes are coalesced.
Bandwidth and arithmetic complexity: Each input element is read $b$ times; the memory traffic scales as $O(b\,d\,n)$ , arithmetic as $O(s)$ per element, and overall wall-time is governed by GPU occupancy and shared-memory bandwidth.
Parameter tuning: $b$ , $s$ , and GPU tile sizes are tuned to optimize system throughput and accuracy (Dwaraknath et al., 2 Feb 2026).

5. Empirical Behavior and Pareto-Optimality

BlockPerm-SJLT, via FlashSketch, demonstrates superior empirical performance on NVIDIA RTX 4090 and A6000 GPUs for key linear algebra and ML workloads:

Benchmarks: Gram-matrix approximation, OSE spectral error, sketch-and-ridge regression, sketch-and-solve least squares, and GraSS ML data-attribution pipeline.
Observations:
- Adjusting $b$ traces a continuous Pareto frontier between sketching accuracy and runtime.
- At moderate error targets, FlashSketch achieves $4\times$ – $5\times$ speedup over previous GPU SJLT kernels and a global geometric-mean speedup of $\approx1.7\times$ .
- For $k \ll d$ , FlashSketch matches or outperforms dense Gaussian projections (cuBLAS) and FHT-based SRHT in speed and, at times, accuracy.
- In GraSS pipelines, projection time per example is reduced by up to $3.2\times$ with unchanged downstream LDS metrics (Dwaraknath et al., 2 Feb 2026).

6. Comparative Summary and Scope of BlockPerm-SJLT

BlockPerm-SJLT unifies the design space for GPU-friendly sparse sketches with a single, interpretable parameter $b$ that interpolates between full locality and complete mixing. This enables both theoretical OSE guarantees parameterized by $\mu_{\mathrm{nbr}}$ and high-efficiency systems implementation. FlashSketch kernel design capitalizes on this regularity for conflict-free parallelism, pushing the speed–accuracy boundary across key RandNLA tasks and scalable ML pipelines, as measured on state-of-the-art GPU hardware.

Parameter	Effect on System	Effect on Statistics
Small $b$	High locality, fast	Poor mixing, more distortion
Large $b$	Higher bandwidth use	Excellent mixing, lower distortion

The BlockPerm-SJLT framework provides a rigorous, tunable approach to large-scale sketching for contemporary GPU architectures while maintaining the theoretical embedding properties central to randomized linear algebra (Dwaraknath et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

FlashSketch: Sketch-Kernel Co-Design for Fast Sparse Sketching on GPUs (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BlockPerm-SJLT.