Fast Walsh–Hadamard Transform (FWHT)
- FWHT is a divide-and-conquer algorithm that computes the Walsh–Hadamard transform using solely addition and subtraction, achieving O(N log N) complexity.
- It employs in-place butterfly operations and exploits memory locality and parallelism for efficiency on modern CPU and GPU architectures.
- FWHT underpins diverse applications in digital communications, quantum information processing, and compressed matrix multiplication with significant speedups.
The Fast Walsh–Hadamard Transform (FWHT) is an in-place, divide-and-conquer algorithm for multiplying an input vector by the Walsh–Hadamard matrix , where is a power of two. The algorithm’s simplicity, reliance only on addition/subtraction, and optimal arithmetic complexity have made it a staple in domains ranging from digital communications and compressed linear algebra to quantum information processing.
1. Mathematical Definition and Structure
Let . The (unnormalized) Walsh–Hadamard matrix is defined recursively as:
Given , the FWHT computes the transform , i.e., . Equivalently, its entries satisfy , where is the bitwise dot product modulo 2. The normalized form, , renders orthonormal: , so (Andersson et al., 14 Jan 2026, Huang et al., 31 Dec 2025).
2. Core Algorithms and Fast Implementations
Standard In-Place “Butterfly” FWHT
The classical FWHT, also known as Yates’ algorithm, employs butterfly operations:
- For each stage (from $0$ to ), pair elements whose indices differ at bit .
- For each such pair , update to .
Formally, for array :
1 2 3 4 5 6 7 8 9 10 11 |
def FWHT_inplace(F): N = len(F) half = 1 while half < N: for i in range(0, N, 2*half): for j in range(half): u = F[i+j] v = F[i+j+half] F[i+j] = u + v F[i+j+half] = u - v half *= 2 |
Improved Arithmetic and Bit Complexities
Recent work has established theoretical improvements:
- Operation count reduction: Using a decomposition of as the sum of a rank-$1$ and a sparse matrix, it is possible to reduce the leading constant for the FWHT to arithmetic operations, compared to the folklore (Alman et al., 2022).
- Lookup tables for bit operations: Over constant-sized fields, precomputing and using lookup tables for small blocks enables an FWHT in bit operations (Alman, 2022).
3. Practical Optimizations and Parallelization
Efficient implementations leverage modern computer architectures:
- In-place permutation and memory locality: Butterfly operations touch only contiguous memory, ensuring cache efficiency; no scatter/gather is required. For multi-dimensional applications, storing data so that axes are transformed in contiguous memory regions maximizes bandwidth (Georges et al., 2024, Huang et al., 31 Dec 2025, Andersson et al., 14 Jan 2026).
- Multi-threaded and SIMD: The independence of different butterfly operations at each stage enables vectorization and threading. On GPUs, butterfly stages map naturally to SIMT warps/blocks. FWHT can be parallelized at both the butterfly and batch level (Agarwal et al., 2024, Huang et al., 31 Dec 2025, Andersson et al., 14 Jan 2026).
- Hardware acceleration: Algorithms such as HadaCore exploit GPU Tensor Cores by restructuring the FWHT to map to hardware matrix-multiply-accumulate primitives, giving speedups up to for size-256 vectors. Larger transforms are tiled and organized via in-register transposes and shared-memory staging (Agarwal et al., 2024).
4. Statistical and Group-Theoretic Properties
The FWHT admits several crucial mathematical properties:
- Self-inverse (up to scaling):
- Energy preservation: , so the normalized transform is isometric.
- Group convolution: On , the FWHT diagonalizes the XOR-convolution:
under the transform (pointwise product), enabling convolution for functions on Boolean cubes (Huang et al., 31 Dec 2025, Andersson et al., 14 Jan 2026).
5. Major Application Domains
Quantum Information Processing
- Pauli decomposition: All Pauli string coefficients of an arbitrary matrix can be computed in time and extra memory via an FWHT-based batched algorithm, which applies an in-place XOR-permutation, row-wise FWHT, and phase correction (Georges et al., 2024).
- Stabilizer Rényi entropy: The computation of the second-order stabilizer entropy is reduced from to using FWHTs of length , exploiting natural parallelism and in-place operations (Huang et al., 31 Dec 2025).
Compressed Matrix Computation
- Compressed multiplication via sketching: Pagh’s algorithm for compressed matrix multiplication can substitute FFT with FWHT, preserving unbiasedness and variance bounds. In practice, the FWHT-based variant is up to faster than FFT-based sketching and can yield speedups over dense DGEMM (Intel MKL) when the product matrix is output-sparse (Andersson et al., 14 Jan 2026).
Digital Communications
- Hadamard coded modulation (HCM): The FWHT enables efficient mapping of data onto Hadamard codewords, resulting in transmission schemes with low peak-to-average-power ratios (PAPR), outperforming OFDM in high-power and nonlinear regimes. Interleaving mitigates inter-symbol interference (ISI) in dispersive channels. HCM achieves PAPR , whereas OFDM’s PAPR grows with (Noshad et al., 2014).
Sparse Signal Processing
- Sparse FWHT (SparseFHT): For -sparse signals in the Hadamard domain, a graph-code-inspired algorithm computes the WHT using time and samples, via subsampling with random hashing and a belief-propagation (peeling) decoder. This is strictly sub-linear in for (Scheibler et al., 2013).
6. Algorithmic Extensions and Specialized Methods
Table: Notable FWHT Variants and Theoretical Innovations
| Paper | Main Innovation | Complexity Improvement |
|---|---|---|
| (Alman et al., 2022) | Low-rank + sparse decomposition for FWHT | ops |
| (Alman, 2022) | Lookup tables for -valued WHT | bits |
| (Andersson et al., 14 Jan 2026) | FWHT for compressed matrix sketching | speedup over FFT |
| (Noshad et al., 2014) | FWHT in HCM for PAPR reduction in communications | PAPR vs. OFDM |
| (Scheibler et al., 2013) | SparseFHT for -sparse signals |
Algorithmic improvements include blockwise recursion with higher-radix kernels, leveraging matrix non-rigidity, tailored bit-block algorithms for finite fields, and the use of hardware-specific kernels (e.g., Tensor Core blocked FWHTs) (Agarwal et al., 2024, Alman et al., 2022, Alman, 2022).
7. Numerical Benchmarks and Implementation Comparisons
Empirical comparisons on CPUs (AMD EPYC, 64-core nodes) and GPUs (NVIDIA A100/H100) reveal that:
- FWHT-based implementations outperform FFT for integer-valued and real transforms where only arithmetic is needed.
- Blocked, in-place, and hardware-optimized FWHT kernels yield gains over previous libraries (e.g., Dao fast-hadamard-transform), with minimal numerical error even in low-precision FP16/BF16 (Agarwal et al., 2024, Andersson et al., 14 Jan 2026).
- In Pauli decomposition for qubits, the FWHT method is to faster than previous tensorized and C++ FWHT implementations (Georges et al., 2024).
References
- "Pauli Decomposition via the Fast Walsh–Hadamard Transform" (Georges et al., 2024)
- "Engineering Compressed Matrix Multiplication with the Fast Walsh–Hadamard Transform" (Andersson et al., 14 Jan 2026)
- "A fast and exact algorithm for stabilizer Rényi entropy via XOR-FWHT" (Huang et al., 31 Dec 2025)
- "HadaCore: Tensor Core Accelerated Hadamard Transform Kernel" (Agarwal et al., 2024)
- "Hadamard Coded Modulation: An Alternative to OFDM for Optical Wireless Communications" (Noshad et al., 2014)
- "Faster Walsh-Hadamard and Discrete Fourier Transforms From Matrix Non-Rigidity" (Alman et al., 2022)
- "Faster Walsh-Hadamard Transform and Matrix Multiplication over Finite Fields using Lookup Tables" (Alman, 2022)
- "A Fast Hadamard Transform for Signals with Sub-linear Sparsity in the Transform Domain" (Scheibler et al., 2013)