Papers
Topics
Authors
Recent
Search
2000 character limit reached

BlockPerm-SJLT: GPU Sparse Sketching

Updated 9 February 2026
  • The paper introduces BlockPerm-SJLT, demonstrating a tunable trade-off between statistical embedding quality and GPU efficiency for large-scale applications.
  • Methodologically, it leverages block-permutation with sparse SJLT blocks to achieve strong Oblivious Subspace Embedding guarantees and controlled error bounds.
  • Empirically, the FlashSketch CUDA kernel delivers 4×–5× speedups on modern GPUs, optimizing the balance between sketching accuracy and runtime performance.

BlockPerm-SJLT is a family of sparse sketching matrices purposefully designed for efficient implementation on modern GPUs while maintaining strong approximation guarantees in the randomized numerical linear algebra (RandNLA) context. The construction generalizes the sparse Johnson-Lindenstrauss transform (SJLT) by introducing a tunable parameter that dictates the trade-off between statistical quality (mixing and embedding guarantees under the oblivious subspace embedding (OSE) framework) and systems efficiency (GPU locality and memory bandwidth usage). This design enables a high-performance CUDA kernel, FlashSketch, delivering improved speed–accuracy Pareto frontiers for large-scale applications (Dwaraknath et al., 2 Feb 2026).

1. Formal Construction of BlockPerm-SJLT

Given input dimension dd and sketch dimension kk, select an integer MM dividing both, such that d=MBcd = M B_c and k=MBrk = M B_r for block sizes BcB_c and BrB_r. The sketching matrix SRk×dS \in \mathbb R^{k \times d} is partitioned into M×MM \times M blocks, with each block Sg,hRBr×BcS_{g,h} \in \mathbb R^{B_r \times B_c}. The block connectivity follows a parameterized family:

  • Fix b1b \geq 1, the block-permutation parameter, denoted κ\kappa in (Dwaraknath et al., 2 Feb 2026).
  • Draw bb edge-disjoint permutations {π:[M][M]}=1b\{\pi_\ell: [M] \to [M]\}_{\ell=1}^b with π(g)π(g)\pi_\ell(g) \neq \pi_{\ell'}(g) for \ell \ne \ell'.
  • For output block-row g[M]g \in [M], define a neighborhood N(g)={π1(g),...,πb(g)}\mathcal N(g) = \{\pi_1(g), ..., \pi_b(g)\}.
  • For each block (g,h)(g,h) with hN(g)h \in \mathcal N(g), place an independent small SJLT Φg,hRBr×Bc\Phi_{g,h} \in \mathbb R^{B_r \times B_c} with exactly ss nonzeros per column (±1/s\pm 1/\sqrt{s} at random row locations).
  • Block entries elsewhere are set to zero, and each nonzero block is further scaled by 1/b1/\sqrt{b}.
  • Overall, every column of SS has exactly bsb\,s nonzeros of magnitude 1/bs1/\sqrt{b\,s}.

This construction interpolates the space between block-diagonal (for b=1b=1) and fully-mixed (for b=Mb=M) architectures.

2. Theoretical Guarantees: Oblivious Subspace Embedding (OSE)

BlockPerm-SJLT satisfies an OSE guarantee governed by block-structure-induced "neighborhood coherence" μnbr\mu_{\mathrm{nbr}}. For any fixed URd×rU \in \mathbb R^{d \times r} with orthonormal columns, define

μnbr(U;π)=Mbmaxg[M]UN(g)22\mu_{\mathrm{nbr}}(U; \pi) = \frac{M}{b} \max_{g \in [M]} \left\| U_{\mathcal N(g)} \right\|_2^2

where UN(g)U_{\mathcal N(g)} stacks the row-blocks U(h)U^{(h)} for hN(g)h \in \mathcal N(g). With t=r+log(1/δ)t = r + \log(1/\delta), the OSE theorem asserts the existence of absolute constants C,c>0C, c > 0 such that if

kCμnbr(U;π)ε2tandbsCtε,k \geq C \frac{\mu_{\mathrm{nbr}}(U; \pi)}{\varepsilon^2} t \qquad \text{and} \qquad b\,s \geq C \frac{t}{\varepsilon},

then with probability at least 1δ1-\delta, the embedding property

USSUIr2ε\| U^\top S^\top S U - I_r \|_2 \leq \varepsilon

holds for all UU, i.e., SS is an (ε,δ)(\varepsilon, \delta)-OSE for the column space of UU (Dwaraknath et al., 2 Feb 2026).

3. Impact of Block-Permutation Parameter bb

The trade-off between sketch quality and GPU locality is governed by bb:

  • Large bb: Promotes thorough mixing (lower μnbr\mu_{\mathrm{nbr}}), making it possible to achieve the OSE for smaller kk, but results in more global memory accesses as each output block aggregates multiple remote input blocks.
  • Small bb: Yields a highly local sketch (block-diagonal when b=1b=1), maximizing memory locality but requiring larger kk to attain the same embedding quality due to reduced inter-block mixing.

The mathematical derivation leverages the energy identity

g=1MxN(g)22=bx22\sum_{g=1}^M \| x_{\mathcal N(g)} \|_2^2 = b \| x \|_2^2

and analyzes tail bounds inherited from the constituent SJLT blocks to relate kk, bb, ss to distortion and concentration. bb determines both the "mixing" among blocks (statistical quality, via μnbr\mu_{\mathrm{nbr}}) and per-column sparsity (systems cost).

4. Efficient GPU Implementation via FlashSketch

FlashSketch is a CUDA kernel tailored to BlockPerm-SJLT's structural regularity:

  • Tiling: Operates on a 2D grid of thread-blocks indexed by output row-block gg and input tile jj, leveraging shared memory for fast sub-matrix operations.
  • On-the-fly computation: Computes N(g)\mathcal N(g) using lightweight affine-mod-MM recurrences and injects sparse updates via fast 32-bit hashing, avoiding the storage of an explicit SS.
  • Atomic elimination: All atomic add operations are confined to shared memory within thread-blocks, avoiding expensive global atomics; final global writes are coalesced.
  • Bandwidth and arithmetic complexity: Each input element is read bb times; the memory traffic scales as O(bdn)O(b\,d\,n), arithmetic as O(s)O(s) per element, and overall wall-time is governed by GPU occupancy and shared-memory bandwidth.
  • Parameter tuning: bb, ss, and GPU tile sizes are tuned to optimize system throughput and accuracy (Dwaraknath et al., 2 Feb 2026).

5. Empirical Behavior and Pareto-Optimality

BlockPerm-SJLT, via FlashSketch, demonstrates superior empirical performance on NVIDIA RTX 4090 and A6000 GPUs for key linear algebra and ML workloads:

  • Benchmarks: Gram-matrix approximation, OSE spectral error, sketch-and-ridge regression, sketch-and-solve least squares, and GraSS ML data-attribution pipeline.
  • Observations:
    • Adjusting bb traces a continuous Pareto frontier between sketching accuracy and runtime.
    • At moderate error targets, FlashSketch achieves 4×4\times5×5\times speedup over previous GPU SJLT kernels and a global geometric-mean speedup of 1.7×\approx1.7\times.
    • For kdk \ll d, FlashSketch matches or outperforms dense Gaussian projections (cuBLAS) and FHT-based SRHT in speed and, at times, accuracy.
    • In GraSS pipelines, projection time per example is reduced by up to 3.2×3.2\times with unchanged downstream LDS metrics (Dwaraknath et al., 2 Feb 2026).

6. Comparative Summary and Scope of BlockPerm-SJLT

BlockPerm-SJLT unifies the design space for GPU-friendly sparse sketches with a single, interpretable parameter bb that interpolates between full locality and complete mixing. This enables both theoretical OSE guarantees parameterized by μnbr\mu_{\mathrm{nbr}} and high-efficiency systems implementation. FlashSketch kernel design capitalizes on this regularity for conflict-free parallelism, pushing the speed–accuracy boundary across key RandNLA tasks and scalable ML pipelines, as measured on state-of-the-art GPU hardware.

Parameter Effect on System Effect on Statistics
Small bb High locality, fast Poor mixing, more distortion
Large bb Higher bandwidth use Excellent mixing, lower distortion

The BlockPerm-SJLT framework provides a rigorous, tunable approach to large-scale sketching for contemporary GPU architectures while maintaining the theoretical embedding properties central to randomized linear algebra (Dwaraknath et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BlockPerm-SJLT.