Papers
Topics
Authors
Recent
Search
2000 character limit reached

Block SRHT: Scalable Randomized Hadamard Transform

Updated 28 January 2026
  • Block SRHT is a structured random matrix technique that uses blockwise subsampled Hadamard transforms to achieve efficient, scalable dimension reduction.
  • It reduces communication and computational costs in distributed low-rank approximation, matching standard SRHT accuracy with improved resource efficiency.
  • Empirical results demonstrate up to 2–3× speedup and near-optimal (1+ε) accuracy in applications such as randomized SVD and Nyström approximation.

The Block Subsampled Randomized Hadamard Transform (Block SRHT) is a structured random matrix construction designed for efficient dimension reduction on distributed architectures. It is obtained by composing blockwise subsampled randomized Hadamard transforms (SRHTs), enabling scalability and resource efficiency in large-scale matrix computations such as randomized low-rank approximation. Block SRHT achieves accuracy guarantees comparable to those of standard SRHT while substantially improving communication and computational costs on distributed systems (Balabanov et al., 2022).

1. Standard Subsampled Randomized Hadamard Transform (SRHT)

The SRHT is a popular structured random projection matrix for fast dimension reduction. For an input dimension nn (a power of two) and sketch size lnl \ll n, the classical SRHT is defined as

Ω=(nl)1/2RHD\Omega = \left(\frac{n}{l}\right)^{1/2} R H D

where:

  • DRn×nD \in \mathbb{R}^{n \times n} is diagonal with i.i.d. Rademacher (±1\pm 1) entries,
  • HRn×nH \in \mathbb{R}^{n \times n} is the Walsh–Hadamard matrix scaled by 1/n1/\sqrt{n},
  • RRl×nR \in \mathbb{R}^{l \times n} randomly samples ll rows of its input (with or without replacement).

Application of SRHT to a vector requires O(nlogn)O(n\log n) flops. For any fixed dd-dimensional subspace VRnV \subset \mathbb{R}^n, if

lCϵ2(d+8log(n/δ))2log(d/δ)l \geq C \epsilon^{-2} \left(\sqrt{d} + \sqrt{8\log(n/\delta)}\right)^2 \log(d/\delta)

for C34C \sim 3–4, then with probability at least 1δ1 - \delta,

xV,(1ϵ)x22Ωx22(1+ϵ)x22.\forall x \in V,\, (1-\epsilon)\|x\|_2^2 \leq \|\Omega x\|_2^2 \leq (1+\epsilon)\|x\|_2^2.

2. Formal Definition of Block SRHT

Suppose nn coordinates are partitioned into pp contiguous blocks of size r=n/pr = n/p (where rr is a power of two or can be zero-padded). The Block SRHT ΩRl×n\Omega \in \mathbb{R}^{l \times n} is constructed as:

Ω=[Ω(1)    Ω(2)    Ω(p)]\Omega = [\, \Omega^{(1)} \;\; \Omega^{(2)} \; \ldots\; \Omega^{(p)} \,]

where each Ω(i)Rl×r\Omega^{(i)} \in \mathbb{R}^{l \times r} is itself an SRHT on rr coordinates, but sharing the sampling step RR and each having its own independent diagonal Rademacher matrices:

Ω(i)=rlD~(i)RHD(i),i=1,,p\Omega^{(i)} = \sqrt{\frac{r}{l}}\, \widetilde{D}^{(i)} R H D^{(i)}, \quad i=1,\dots,p

with:

  • D(i)Rr×rD^{(i)} \in \mathbb{R}^{r\times r} and D~(i)Rl×l\widetilde{D}^{(i)} \in \mathbb{R}^{l\times l}, independent diagonal Rademacher matrices,
  • HRr×rH\in \mathbb{R}^{r\times r} the scaled Hadamard matrix,
  • RRl×rR \in \mathbb{R}^{l \times r} selects ll rows of its input with replacement.

This blockwise decomposition "splits" the n×nn\times n Hadamard into pp local r×rr\times r Hadamards, with local sign flips before and after, and a shared sampling operator. The blocks are stacked horizontally to construct the global matrix.

3. The Oblivious Subspace Embedding Property

A random ΩRl×n\Omega \in \mathbb{R}^{l\times n} is an (ϵ,δ,d)(\epsilon, \delta, d) oblivious subspace embedding (OSE) if for every fixed dd-dimensional subspace VRnV \subset \mathbb{R}^n, with probability at least 1δ1-\delta,

xV,Ωx22x22ϵx22.\forall x\in V,\quad |\|\Omega x\|_2^2 - \|x\|_2^2| \leq \epsilon \|x\|_2^2.

The main theorem [(Balabanov et al., 2022), Balabanov–Beaup, 2021] asserts that, under the block-SRHT construction above, if

l3.7ϵ2(d+4log(n/δ)+6.3)2log(5d/δ)l \geq 3.7\,\epsilon^{-2}\left(\sqrt{d} + 4\sqrt{\log(n/\delta)} + 6.3\right)^2 \log(5d/\delta)

then Ω\Omega is an (ϵ,δ,d)(\epsilon,\delta,d) OSE. This lower bound matches the standard SRHT up to constants and logarithmic factors. The proof involves a replacement trick for sampling, Rademacher-Lipschitz tail bounds to show uniformity of row norms, and a matrix-Chernoff argument for preservation of singular values under row sampling.

4. Deployment in Randomized Matrix Algorithms

Block SRHT can be directly incorporated into distributed low-rank approximation algorithms.

4.1 Generation in Practice

Given n,p,r,ln, p, r, l, construct Ω\Omega by:

  • For i=1,,pi = 1, …, p:
    • Generate D(i){±1}r×rD^{(i)} \in \{\pm 1\}^{r\times r} and D~(i){±1}l×l\widetilde{D}^{(i)} \in \{\pm 1\}^{l\times l},
    • Build Hadamard HRr×rH \in \mathbb{R}^{r\times r},
    • Compute Ω(i)=r/lD~(i)RHD(i)\Omega^{(i)} = \sqrt{r/l}\, \widetilde{D}^{(i)} R H D^{(i)}.

Stack horizontally to obtain Ω\Omega.

4.2 Distributed Application

If a tall matrix VRn×dV \in \mathbb{R}^{n\times d} is partitioned row-wise as V=[V(1);;V(p)]V = [V^{(1)};…;V^{(p)}], then

ΩV=i=1pΩ(i)V(i)\Omega V = \sum_{i=1}^p \Omega^{(i)} V^{(i)}

Each node computes its local sketch, followed by a global sum-reduce with O(logp)O(\log p) latency and O(dllogp)O(dl\log p) total bandwidth.

4.3 Example: Randomized SVD

Given ARm×nA\in \mathbb{R}^{m\times n}, target rank kk, sketch size ll, the steps are:

  1. Y=AΩY = A \Omega^\top (1 distributed pass),
  2. Orthonormalize YQY \rightarrow Q,
  3. Z=QAZ = Q^\top A (2nd pass),
  4. [P,R]=QR(Z)[P,R] = QR(Z^\top),
  5. Compute SVD of RR^\top,
  6. Output Uk=QU~kU_k = Q\widetilde{U}_k, Σk\Sigma_k, Vk=PV~kV_k = P\widetilde{V}_k.

By the OSE property, this yields quasi-optimal (1+O(ϵ))(1+O(\epsilon))-accuracy with high probability.

4.4 Example: Nyström Approximation

For ARn×n0A\in \mathbb{R}^{n\times n}\succeq 0, kk, sketch Ω\Omega:

  1. Y=AΩY = A\Omega^\top,
  2. Cholesky or SVD: YY=CCY^\top Y = C C^\top,
  3. Z=YC1Z = Y C^{-1},
  4. [Qz,R]=QR(Z)[Q_z, R] = QR(Z),
  5. SVD(RR),
  6. U^k=YV~kΣk1\widehat{U}_k = Y\widetilde{V}_k \Sigma_k^{-1},
  7. Bk=U^kΣk2U^kB_k = \widehat{U}_k \Sigma_k^2 \widehat{U}_k^\top.

This achieves the (1+ϵ)(1+\epsilon) relative trace-norm guarantee.

5. Complexity, Communication, and Memory Analysis

For given dd (columns in sketch), ll (sketch rows), r=n/pr=n/p (local block size):

  • Computational cost per node: O(rdlogr)O(r d \log r) flops for local Hadamard transforms, plus O(dllogp)O(dl\log p) for sum-reduce.
  • Communication: A single all-reduce of d×ld \times l matrices, O(logp)O(\log p) messages and O(dllogp)O(dl\log p) bytes.
  • Memory footprint: Only need to store local V(i)V^{(i)} (r×dr \times d) and local Ω(i)\Omega^{(i)} (l×rl \times r), which can often be generated on-the-fly; Hadamard matrices do not require storage.

A comparison is summarized below:

Method Local Flops Communication Memory per Node
Gaussian O(rdl)O(r d l) O(dllogp)O(dl\log p) store dense r×lr \times l
Block SRHT O(rdlogr)O(r d\log r) O(dllogp)O(dl\log p) typically 12\leq \frac{1}{2} of Gaussian
Standard SRHT O(nlogn)O(n \log n) (global) O(nlogn)O(n\log n) (global butterfly) requires full global matrix

Block SRHT achieves far better scalability once p8p\gtrsim8–$16$, since it completely replaces the global Hadamard communication pattern with local computations and a flat sum-reduce.

6. Large-Scale Empirical Evaluation

Experiments utilizing Julia implementations with 32 cores per node, performed on kernel matrices AR65536×65536A\in\mathbb{R}^{65536\times 65536} (e.g., MNIST, YearPredictionMSD), provide these observations:

  • Both Gaussian and block SRHT sketches produce essentially identical trace-norm relative errors, with error decaying rapidly in kk and matching full SVD tails.
  • Runtime: for l=2000l=2000, Gaussian sketching requires 12.7\sim12.7 s, while block SRHT requires 4.8\sim4.8 s (2.5×\sim2.5\times speedup); Gaussian runtime grows linearly in ll, while block SRHT growth is sublinear, dominated by the communication phase.
  • Strong scaling (n=107n=10^7, l=2000l=2000, d=200d=200): block SRHT achieves perfect local scaling to p=384p=384 cores, after which all-reduce costs increase; Gaussian methods have higher local costs and encounter memory issues at larger pp.
  • For n=108n=10^8: block SRHT remains memory-efficient even for large rr, whereas Gaussian sketches can exhaust node memory.
  • Weak scaling: block SRHT retains $2$–3×3\times speedup up to p=1536p=1536, with flat local costs and gradually increasing communication.

On practical clusters, block SRHT matches Gaussian accuracy and outperforms both Gaussian and standard SRHT by up to $2$–3×3\times on sketch-time, scaling to thousands of cores (Balabanov et al., 2022).

7. Parameter Tuning and Best Practices

Parameter selection guidelines for block SRHT on real clusters:

  • Number of blocks pp: Choose pp such that r=n/pr=n/p is a power of two (or can be padded), and r10dr \gtrsim 10d to maintain small constants in ϵ\epsilon-bounds.
  • Sketch rows ll: For subspace dimension dd (e.g., d=k+10d=k+10–$20$ for oversampling), set lO(ϵ2(d+log(n/δ))log(d/δ))l \sim O(\epsilon^{-2}(d + \log(n/\delta))\log(d/\delta)).
  • Oversampling ldl-d: Typically, ld20l - d \sim 20–$50$ suffices for ϵ0.1\epsilon \sim 0.1–$0.2$.
  • Block size rr vs ll: If lrl\leq r, block SRHT reduces communication by a factor r/lr/l relative to standard SRHT. Aim for l/r0.05l/r\sim 0.05–$0.2$.
  • Accuracy parameters: For many ML tasks, ϵ0.1\epsilon \approx 0.1, δ106\delta \approx 10^{-6} suffice.

A plausible implication is that, by carefully tuning these parameters, one can balance local computation, memory, and inter-node communication to achieve near-optimal performance for large-scale randomized linear algebra.

Conclusion

Block SRHT attains the same strong embedding guarantees as standard SRHT in terms of sketch size and accuracy, but eliminates the need for global Hadamard transforms and communication-intensive butterfly reductions. It leverages independent local transforms and a simple global sum-reduce to achieve 2–3× speed improvements over Gaussian projection and orders-of-magnitude better scalability than standard SRHT, while maintaining identical accuracy in applications such as randomized SVD and Nyström approximation. Choosing pp and ll so r=n/pdr = n/p \gg d and lrl \ll r ensures minimal communication, memory usage, and a provable (1±ϵ)(1\pm\epsilon) isometry for arbitrary dd-dimensional subspaces with high probability (Balabanov et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Block Subsampled Randomized Hadamard Transform (Block SRHT).