Block SRHT: Scalable Randomized Hadamard Transform

Updated 28 January 2026

Block SRHT is a structured random matrix technique that uses blockwise subsampled Hadamard transforms to achieve efficient, scalable dimension reduction.
It reduces communication and computational costs in distributed low-rank approximation, matching standard SRHT accuracy with improved resource efficiency.
Empirical results demonstrate up to 2–3× speedup and near-optimal (1+ε) accuracy in applications such as randomized SVD and Nyström approximation.

The Block Subsampled Randomized Hadamard Transform (Block SRHT) is a structured random matrix construction designed for efficient dimension reduction on distributed architectures. It is obtained by composing blockwise subsampled randomized Hadamard transforms (SRHTs), enabling scalability and resource efficiency in large-scale matrix computations such as randomized low-rank approximation. Block SRHT achieves accuracy guarantees comparable to those of standard SRHT while substantially improving communication and computational costs on distributed systems (Balabanov et al., 2022).

1. Standard Subsampled Randomized Hadamard Transform (SRHT)

The SRHT is a popular structured random projection matrix for fast dimension reduction. For an input dimension $n$ (a power of two) and sketch size $l \ll n$ , the classical SRHT is defined as

$\Omega = \left(\frac{n}{l}\right)^{1/2} R H D$

where:

$D \in \mathbb{R}^{n \times n}$ is diagonal with i.i.d. Rademacher ( $\pm 1$ ) entries,
$H \in \mathbb{R}^{n \times n}$ is the Walsh–Hadamard matrix scaled by $1/\sqrt{n}$ ,
$R \in \mathbb{R}^{l \times n}$ randomly samples $l$ rows of its input (with or without replacement).

Application of SRHT to a vector requires $O(n\log n)$ flops. For any fixed $d$ -dimensional subspace $V \subset \mathbb{R}^n$ , if

$l \geq C \epsilon^{-2} \left(\sqrt{d} + \sqrt{8\log(n/\delta)}\right)^2 \log(d/\delta)$

for $C \sim 3–4$ , then with probability at least $1 - \delta$ ,

$\forall x \in V,\, (1-\epsilon)\|x\|_2^2 \leq \|\Omega x\|_2^2 \leq (1+\epsilon)\|x\|_2^2.$

2. Formal Definition of Block SRHT

Suppose $n$ coordinates are partitioned into $p$ contiguous blocks of size $r = n/p$ (where $r$ is a power of two or can be zero-padded). The Block SRHT $\Omega \in \mathbb{R}^{l \times n}$ is constructed as:

$\Omega = [\, \Omega^{(1)} \;\; \Omega^{(2)} \; \ldots\; \Omega^{(p)} \,]$

where each $\Omega^{(i)} \in \mathbb{R}^{l \times r}$ is itself an SRHT on $r$ coordinates, but sharing the sampling step $R$ and each having its own independent diagonal Rademacher matrices:

$\Omega^{(i)} = \sqrt{\frac{r}{l}}\, \widetilde{D}^{(i)} R H D^{(i)}, \quad i=1,\dots,p$

with:

$D^{(i)} \in \mathbb{R}^{r\times r}$ and $\widetilde{D}^{(i)} \in \mathbb{R}^{l\times l}$ , independent diagonal Rademacher matrices,
$H\in \mathbb{R}^{r\times r}$ the scaled Hadamard matrix,
$R \in \mathbb{R}^{l \times r}$ selects $l$ rows of its input with replacement.

This blockwise decomposition "splits" the $n\times n$ Hadamard into $p$ local $r\times r$ Hadamards, with local sign flips before and after, and a shared sampling operator. The blocks are stacked horizontally to construct the global matrix.

3. The Oblivious Subspace Embedding Property

A random $\Omega \in \mathbb{R}^{l\times n}$ is an $(\epsilon, \delta, d)$ oblivious subspace embedding (OSE) if for every fixed $d$ -dimensional subspace $V \subset \mathbb{R}^n$ , with probability at least $1-\delta$ ,

$\forall x\in V,\quad |\|\Omega x\|_2^2 - \|x\|_2^2| \leq \epsilon \|x\|_2^2.$

The main theorem [(Balabanov et al., 2022), Balabanov–Beaup, 2021] asserts that, under the block-SRHT construction above, if

$l \geq 3.7\,\epsilon^{-2}\left(\sqrt{d} + 4\sqrt{\log(n/\delta)} + 6.3\right)^2 \log(5d/\delta)$

then $\Omega$ is an $(\epsilon,\delta,d)$ OSE. This lower bound matches the standard SRHT up to constants and logarithmic factors. The proof involves a replacement trick for sampling, Rademacher-Lipschitz tail bounds to show uniformity of row norms, and a matrix-Chernoff argument for preservation of singular values under row sampling.

4. Deployment in Randomized Matrix Algorithms

Block SRHT can be directly incorporated into distributed low-rank approximation algorithms.

4.1 Generation in Practice

Given $n, p, r, l$ , construct $\Omega$ by:

For $i = 1, …, p$ $i = 1, \dots, p$ :
- Generate $D^{(i)} \in \{\pm 1\}^{r\times r}$ and $\widetilde{D}^{(i)} \in \{\pm 1\}^{l\times l}$ ,
- Build Hadamard $H \in \mathbb{R}^{r\times r}$ ,
- Compute $\Omega^{(i)} = \sqrt{r/l}\, \widetilde{D}^{(i)} R H D^{(i)}$ .

Stack horizontally to obtain $\Omega$ .

4.2 Distributed Application

If a tall matrix $V \in \mathbb{R}^{n\times d}$ is partitioned row-wise as $V = [V^{(1)};…;V^{(p)}]$ , then

$\Omega V = \sum_{i=1}^p \Omega^{(i)} V^{(i)}$

Each node computes its local sketch, followed by a global sum-reduce with $O(\log p)$ latency and $O(dl\log p)$ total bandwidth.

4.3 Example: Randomized SVD

Given $A\in \mathbb{R}^{m\times n}$ , target rank $k$ , sketch size $l$ , the steps are:

$Y = A \Omega^\top$ (1 distributed pass),
Orthonormalize $Y \rightarrow Q$ ,
$Z = Q^\top A$ (2nd pass),
$[P,R] = QR(Z^\top)$ ,
Compute SVD of $R^\top$ ,
Output $U_k = Q\widetilde{U}_k$ , $\Sigma_k$ , $V_k = P\widetilde{V}_k$ .

By the OSE property, this yields quasi-optimal $(1+O(\epsilon))$ -accuracy with high probability.

4.4 Example: Nyström Approximation

For $A\in \mathbb{R}^{n\times n}\succeq 0$ , $k$ , sketch $\Omega$ :

$Y = A\Omega^\top$ ,
Cholesky or SVD: $Y^\top Y = C C^\top$ ,
$Z = Y C^{-1}$ ,
$[Q_z, R] = QR(Z)$ ,
SVD( $R$ ),
$\widehat{U}_k = Y\widetilde{V}_k \Sigma_k^{-1}$ ,
$B_k = \widehat{U}_k \Sigma_k^2 \widehat{U}_k^\top$ .

This achieves the $(1+\epsilon)$ relative trace-norm guarantee.

5. Complexity, Communication, and Memory Analysis

For given $d$ (columns in sketch), $l$ (sketch rows), $r=n/p$ (local block size):

Computational cost per node: $O(r d \log r)$ flops for local Hadamard transforms, plus $O(dl\log p)$ for sum-reduce.
Communication: A single all-reduce of $d \times l$ matrices, $O(\log p)$ messages and $O(dl\log p)$ bytes.
Memory footprint: Only need to store local $V^{(i)}$ ( $r \times d$ ) and local $\Omega^{(i)}$ ( $l \times r$ ), which can often be generated on-the-fly; Hadamard matrices do not require storage.

A comparison is summarized below:

Method	Local Flops	Communication	Memory per Node
Gaussian	$O(r d l)$	$O(dl\log p)$	store dense $r \times l$
Block SRHT	$O(r d\log r)$	$O(dl\log p)$	typically $\leq \frac{1}{2}$ of Gaussian
Standard SRHT	$O(n \log n)$ (global)	$O(n\log n)$ (global butterfly)	requires full global matrix

Block SRHT achieves far better scalability once $p\gtrsim8$ –$16$, since it completely replaces the global Hadamard communication pattern with local computations and a flat sum-reduce.

6. Large-Scale Empirical Evaluation

Experiments utilizing Julia implementations with 32 cores per node, performed on kernel matrices $A\in\mathbb{R}^{65536\times 65536}$ (e.g., MNIST, YearPredictionMSD), provide these observations:

Both Gaussian and block SRHT sketches produce essentially identical trace-norm relative errors, with error decaying rapidly in $k$ and matching full SVD tails.
Runtime: for $l=2000$ , Gaussian sketching requires $\sim12.7$ s, while block SRHT requires $\sim4.8$ s ( $\sim2.5\times$ speedup); Gaussian runtime grows linearly in $l$ , while block SRHT growth is sublinear, dominated by the communication phase.
Strong scaling ( $n=10^7$ , $l=2000$ , $d=200$ ): block SRHT achieves perfect local scaling to $p=384$ cores, after which all-reduce costs increase; Gaussian methods have higher local costs and encounter memory issues at larger $p$ .
For $n=10^8$ : block SRHT remains memory-efficient even for large $r$ , whereas Gaussian sketches can exhaust node memory.
Weak scaling: block SRHT retains $2$– $3\times$ speedup up to $p=1536$ , with flat local costs and gradually increasing communication.

On practical clusters, block SRHT matches Gaussian accuracy and outperforms both Gaussian and standard SRHT by up to $2$– $3\times$ on sketch-time, scaling to thousands of cores (Balabanov et al., 2022).

7. Parameter Tuning and Best Practices

Parameter selection guidelines for block SRHT on real clusters:

Number of blocks $p$ : Choose $p$ such that $r=n/p$ is a power of two (or can be padded), and $r \gtrsim 10d$ to maintain small constants in $\epsilon$ -bounds.
Sketch rows $l$ : For subspace dimension $d$ (e.g., $d=k+10$ –$20$ for oversampling), set $l \sim O(\epsilon^{-2}(d + \log(n/\delta))\log(d/\delta))$ .
Oversampling $l-d$ : Typically, $l - d \sim 20$ –$50$ suffices for $\epsilon \sim 0.1$ –$0.2$.
Block size $r$ vs $l$ : If $l\leq r$ , block SRHT reduces communication by a factor $r/l$ relative to standard SRHT. Aim for $l/r\sim 0.05$ –$0.2$.
Accuracy parameters: For many ML tasks, $\epsilon \approx 0.1$ , $\delta \approx 10^{-6}$ suffice.

A plausible implication is that, by carefully tuning these parameters, one can balance local computation, memory, and inter-node communication to achieve near-optimal performance for large-scale randomized linear algebra.

Conclusion

Block SRHT attains the same strong embedding guarantees as standard SRHT in terms of sketch size and accuracy, but eliminates the need for global Hadamard transforms and communication-intensive butterfly reductions. It leverages independent local transforms and a simple global sum-reduce to achieve 2–3× speed improvements over Gaussian projection and orders-of-magnitude better scalability than standard SRHT, while maintaining identical accuracy in applications such as randomized SVD and Nyström approximation. Choosing $p$ and $l$ so $r = n/p \gg d$ and $l \ll r$ ensures minimal communication, memory usage, and a provable $(1\pm\epsilon)$ isometry for arbitrary $d$ -dimensional subspaces with high probability (Balabanov et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Block subsampled randomized Hadamard transform for low-rank approximation on distributed architectures (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Block Subsampled Randomized Hadamard Transform (Block SRHT).