Block SRHT: Scalable Randomized Hadamard Transform
- Block SRHT is a structured random matrix technique that uses blockwise subsampled Hadamard transforms to achieve efficient, scalable dimension reduction.
- It reduces communication and computational costs in distributed low-rank approximation, matching standard SRHT accuracy with improved resource efficiency.
- Empirical results demonstrate up to 2–3× speedup and near-optimal (1+ε) accuracy in applications such as randomized SVD and Nyström approximation.
The Block Subsampled Randomized Hadamard Transform (Block SRHT) is a structured random matrix construction designed for efficient dimension reduction on distributed architectures. It is obtained by composing blockwise subsampled randomized Hadamard transforms (SRHTs), enabling scalability and resource efficiency in large-scale matrix computations such as randomized low-rank approximation. Block SRHT achieves accuracy guarantees comparable to those of standard SRHT while substantially improving communication and computational costs on distributed systems (Balabanov et al., 2022).
1. Standard Subsampled Randomized Hadamard Transform (SRHT)
The SRHT is a popular structured random projection matrix for fast dimension reduction. For an input dimension (a power of two) and sketch size , the classical SRHT is defined as
where:
- is diagonal with i.i.d. Rademacher () entries,
- is the Walsh–Hadamard matrix scaled by ,
- randomly samples rows of its input (with or without replacement).
Application of SRHT to a vector requires flops. For any fixed -dimensional subspace , if
for , then with probability at least ,
2. Formal Definition of Block SRHT
Suppose coordinates are partitioned into contiguous blocks of size (where is a power of two or can be zero-padded). The Block SRHT is constructed as:
where each is itself an SRHT on coordinates, but sharing the sampling step and each having its own independent diagonal Rademacher matrices:
with:
- and , independent diagonal Rademacher matrices,
- the scaled Hadamard matrix,
- selects rows of its input with replacement.
This blockwise decomposition "splits" the Hadamard into local Hadamards, with local sign flips before and after, and a shared sampling operator. The blocks are stacked horizontally to construct the global matrix.
3. The Oblivious Subspace Embedding Property
A random is an oblivious subspace embedding (OSE) if for every fixed -dimensional subspace , with probability at least ,
The main theorem [(Balabanov et al., 2022), Balabanov–Beaup, 2021] asserts that, under the block-SRHT construction above, if
then is an OSE. This lower bound matches the standard SRHT up to constants and logarithmic factors. The proof involves a replacement trick for sampling, Rademacher-Lipschitz tail bounds to show uniformity of row norms, and a matrix-Chernoff argument for preservation of singular values under row sampling.
4. Deployment in Randomized Matrix Algorithms
Block SRHT can be directly incorporated into distributed low-rank approximation algorithms.
4.1 Generation in Practice
Given , construct by:
- For :
- Generate and ,
- Build Hadamard ,
- Compute .
Stack horizontally to obtain .
4.2 Distributed Application
If a tall matrix is partitioned row-wise as , then
Each node computes its local sketch, followed by a global sum-reduce with latency and total bandwidth.
4.3 Example: Randomized SVD
Given , target rank , sketch size , the steps are:
- (1 distributed pass),
- Orthonormalize ,
- (2nd pass),
- ,
- Compute SVD of ,
- Output , , .
By the OSE property, this yields quasi-optimal -accuracy with high probability.
4.4 Example: Nyström Approximation
For , , sketch :
- ,
- Cholesky or SVD: ,
- ,
- ,
- SVD(),
- ,
- .
This achieves the relative trace-norm guarantee.
5. Complexity, Communication, and Memory Analysis
For given (columns in sketch), (sketch rows), (local block size):
- Computational cost per node: flops for local Hadamard transforms, plus for sum-reduce.
- Communication: A single all-reduce of matrices, messages and bytes.
- Memory footprint: Only need to store local () and local (), which can often be generated on-the-fly; Hadamard matrices do not require storage.
A comparison is summarized below:
| Method | Local Flops | Communication | Memory per Node |
|---|---|---|---|
| Gaussian | store dense | ||
| Block SRHT | typically of Gaussian | ||
| Standard SRHT | (global) | (global butterfly) | requires full global matrix |
Block SRHT achieves far better scalability once –$16$, since it completely replaces the global Hadamard communication pattern with local computations and a flat sum-reduce.
6. Large-Scale Empirical Evaluation
Experiments utilizing Julia implementations with 32 cores per node, performed on kernel matrices (e.g., MNIST, YearPredictionMSD), provide these observations:
- Both Gaussian and block SRHT sketches produce essentially identical trace-norm relative errors, with error decaying rapidly in and matching full SVD tails.
- Runtime: for , Gaussian sketching requires s, while block SRHT requires s ( speedup); Gaussian runtime grows linearly in , while block SRHT growth is sublinear, dominated by the communication phase.
- Strong scaling (, , ): block SRHT achieves perfect local scaling to cores, after which all-reduce costs increase; Gaussian methods have higher local costs and encounter memory issues at larger .
- For : block SRHT remains memory-efficient even for large , whereas Gaussian sketches can exhaust node memory.
- Weak scaling: block SRHT retains $2$– speedup up to , with flat local costs and gradually increasing communication.
On practical clusters, block SRHT matches Gaussian accuracy and outperforms both Gaussian and standard SRHT by up to $2$– on sketch-time, scaling to thousands of cores (Balabanov et al., 2022).
7. Parameter Tuning and Best Practices
Parameter selection guidelines for block SRHT on real clusters:
- Number of blocks : Choose such that is a power of two (or can be padded), and to maintain small constants in -bounds.
- Sketch rows : For subspace dimension (e.g., –$20$ for oversampling), set .
- Oversampling : Typically, –$50$ suffices for –$0.2$.
- Block size vs : If , block SRHT reduces communication by a factor relative to standard SRHT. Aim for –$0.2$.
- Accuracy parameters: For many ML tasks, , suffice.
A plausible implication is that, by carefully tuning these parameters, one can balance local computation, memory, and inter-node communication to achieve near-optimal performance for large-scale randomized linear algebra.
Conclusion
Block SRHT attains the same strong embedding guarantees as standard SRHT in terms of sketch size and accuracy, but eliminates the need for global Hadamard transforms and communication-intensive butterfly reductions. It leverages independent local transforms and a simple global sum-reduce to achieve 2–3× speed improvements over Gaussian projection and orders-of-magnitude better scalability than standard SRHT, while maintaining identical accuracy in applications such as randomized SVD and Nyström approximation. Choosing and so and ensures minimal communication, memory usage, and a provable isometry for arbitrary -dimensional subspaces with high probability (Balabanov et al., 2022).