Fast Walsh–Hadamard Transform (FWHT)

Updated 28 January 2026

FWHT is a divide-and-conquer algorithm that computes the Walsh–Hadamard transform using solely addition and subtraction, achieving O(N log N) complexity.
It employs in-place butterfly operations and exploits memory locality and parallelism for efficiency on modern CPU and GPU architectures.
FWHT underpins diverse applications in digital communications, quantum information processing, and compressed matrix multiplication with significant speedups.

The Fast Walsh–Hadamard Transform (FWHT) is an in-place, divide-and-conquer algorithm for multiplying an input vector by the Walsh–Hadamard matrix $H_N$ , where $N$ is a power of two. The algorithm’s simplicity, reliance only on addition/subtraction, and optimal $O(N\log N)$ arithmetic complexity have made it a staple in domains ranging from digital communications and compressed linear algebra to quantum information processing.

1. Mathematical Definition and Structure

Let $N = 2^n$ . The (unnormalized) Walsh–Hadamard matrix $H_N \in \{\pm1\}^{N\times N}$ is defined recursively as:

$H_1 = [1]$
$H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}$

Given $x\in \mathbb{R}^N$ , the FWHT computes the transform $y = H_N x$ , i.e., $y_i = \sum_{j=0}^{N-1} (H_N)_{i,j}x_j$ . Equivalently, its entries satisfy $N$ 0, where $N$ 1 is the bitwise dot product modulo 2. The normalized form, $N$ 2, renders $N$ 3 orthonormal: $N$ 4, so $N$ 5 (Andersson et al., 14 Jan 2026, Huang et al., 31 Dec 2025).

2. Core Algorithms and Fast Implementations

Standard In-Place “Butterfly” FWHT

The classical FWHT, also known as Yates’ algorithm, employs butterfly operations:

For each stage $N$ 6 (from $N$ 7 to $N$ 8), pair elements whose indices differ at bit $N$ 9.
For each such pair $O(N\log N)$ 0, update to $O(N\log N)$ 1.

Formally, for array $O(N\log N)$ 2: $H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}$ 7 This routine requires $O(N\log N)$ 3 additions/subtractions, no multiplications beyond $O(N\log N)$ 4 (Huang et al., 31 Dec 2025, Noshad et al., 2014). The transform is entirely in-place, using $O(N\log N)$ 5 auxiliary space.

Improved Arithmetic and Bit Complexities

Recent work has established theoretical improvements:

Operation count reduction: Using a decomposition of $O(N\log N)$ 6 as the sum of a rank- $O(N\log N)$ 7 and a sparse matrix, it is possible to reduce the leading constant for the FWHT to $O(N\log N)$ 8 arithmetic operations, compared to the folklore $O(N\log N)$ 9 (Alman et al., 2022).
Lookup tables for bit operations: Over constant-sized fields, precomputing and using lookup tables for small blocks enables an FWHT in $N = 2^n$ 0 bit operations (Alman, 2022).

3. Practical Optimizations and Parallelization

Efficient implementations leverage modern computer architectures:

In-place permutation and memory locality: Butterfly operations touch only contiguous memory, ensuring cache efficiency; no scatter/gather is required. For multi-dimensional applications, storing data so that axes are transformed in contiguous memory regions maximizes bandwidth (Georges et al., 2024, Huang et al., 31 Dec 2025, Andersson et al., 14 Jan 2026).
Multi-threaded and SIMD: The independence of different butterfly operations at each stage enables vectorization and threading. On GPUs, butterfly stages map naturally to SIMT warps/blocks. FWHT can be parallelized at both the butterfly and batch level (Agarwal et al., 2024, Huang et al., 31 Dec 2025, Andersson et al., 14 Jan 2026).
Hardware acceleration: Algorithms such as HadaCore exploit GPU Tensor Cores by restructuring the FWHT to map to hardware matrix-multiply-accumulate primitives, giving speedups up to $N = 2^n$ 1 for size-256 vectors. Larger transforms are tiled and organized via in-register transposes and shared-memory staging (Agarwal et al., 2024).

4. Statistical and Group-Theoretic Properties

The FWHT admits several crucial mathematical properties:

Self-inverse (up to scaling): $N = 2^n$ 2
Energy preservation: $N = 2^n$ 3, so the normalized transform is isometric.
Group convolution: On $N = 2^n$ 4, the FWHT diagonalizes the XOR-convolution:

$N = 2^n$ 5

under the transform $N = 2^n$ 6 (pointwise product), enabling $N = 2^n$ 7 convolution for functions on Boolean cubes (Huang et al., 31 Dec 2025, Andersson et al., 14 Jan 2026).

5. Major Application Domains

Quantum Information Processing

Pauli decomposition: All Pauli string coefficients of an arbitrary $N = 2^n$ 8 matrix $N = 2^n$ 9 can be computed in $H_N \in \{\pm1\}^{N\times N}$ 0 time and $H_N \in \{\pm1\}^{N\times N}$ 1 extra memory via an FWHT-based batched algorithm, which applies an in-place XOR-permutation, row-wise FWHT, and phase correction (Georges et al., 2024).
Stabilizer Rényi entropy: The computation of the second-order stabilizer entropy is reduced from $H_N \in \{\pm1\}^{N\times N}$ 2 to $H_N \in \{\pm1\}^{N\times N}$ 3 using $H_N \in \{\pm1\}^{N\times N}$ 4 FWHTs of length $H_N \in \{\pm1\}^{N\times N}$ 5, exploiting natural parallelism and in-place operations (Huang et al., 31 Dec 2025).

Compressed Matrix Computation

Compressed multiplication via sketching: Pagh’s algorithm for compressed matrix multiplication can substitute FFT with FWHT, preserving unbiasedness and variance bounds. In practice, the FWHT-based variant is up to $H_N \in \{\pm1\}^{N\times N}$ 6 faster than FFT-based sketching and can yield $H_N \in \{\pm1\}^{N\times N}$ 7 speedups over dense DGEMM (Intel MKL) when the product matrix is output-sparse (Andersson et al., 14 Jan 2026).

Digital Communications

Hadamard coded modulation (HCM): The FWHT enables efficient mapping of data onto Hadamard codewords, resulting in transmission schemes with low peak-to-average-power ratios (PAPR), outperforming OFDM in high-power and nonlinear regimes. Interleaving mitigates inter-symbol interference (ISI) in dispersive channels. HCM achieves PAPR $H_N \in \{\pm1\}^{N\times N}$ 8, whereas OFDM’s PAPR grows with $H_N \in \{\pm1\}^{N\times N}$ 9 (Noshad et al., 2014).

Sparse Signal Processing

Sparse FWHT (SparseFHT): For $H_1 = [1]$ 0-sparse signals in the Hadamard domain, a graph-code-inspired algorithm computes the WHT using $H_1 = [1]$ 1 time and $H_1 = [1]$ 2 samples, via subsampling with random hashing and a belief-propagation (peeling) decoder. This is strictly sub-linear in $H_1 = [1]$ 3 for $H_1 = [1]$ 4 (Scheibler et al., 2013).

6. Algorithmic Extensions and Specialized Methods

Table: Notable FWHT Variants and Theoretical Innovations

Paper	Main Innovation	Complexity Improvement
(Alman et al., 2022)	Low-rank + sparse decomposition for FWHT	$H_1 = [1]$ 5 ops
(Alman, 2022)	Lookup tables for $H_1 = [1]$ 6-valued WHT	$H_1 = [1]$ 7 bits
(Andersson et al., 14 Jan 2026)	FWHT for compressed matrix sketching	$H_1 = [1]$ 8 speedup over FFT
(Noshad et al., 2014)	FWHT in HCM for PAPR reduction in communications	PAPR $H_1 = [1]$ 9 vs. OFDM
(Scheibler et al., 2013)	SparseFHT for $H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}$ 0-sparse signals	$H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}$ 1

Algorithmic improvements include blockwise recursion with higher-radix kernels, leveraging matrix non-rigidity, tailored bit-block algorithms for finite fields, and the use of hardware-specific kernels (e.g., Tensor Core blocked FWHTs) (Agarwal et al., 2024, Alman et al., 2022, Alman, 2022).

7. Numerical Benchmarks and Implementation Comparisons

Empirical comparisons on CPUs (AMD EPYC, 64-core nodes) and GPUs (NVIDIA A100/H100) reveal that:

FWHT-based implementations outperform FFT for integer-valued and real transforms where only $H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}$ 2 arithmetic is needed.
Blocked, in-place, and hardware-optimized FWHT kernels yield $H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}$ 3 gains over previous libraries (e.g., Dao fast-hadamard-transform), with minimal numerical error even in low-precision FP16/BF16 (Agarwal et al., 2024, Andersson et al., 14 Jan 2026).
In Pauli decomposition for $H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}$ 4 qubits, the FWHT method is $H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}$ 5 to $H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}$ 6 faster than previous tensorized and C++ FWHT implementations (Georges et al., 2024).

References

"Pauli Decomposition via the Fast Walsh–Hadamard Transform" (Georges et al., 2024)
"Engineering Compressed Matrix Multiplication with the Fast Walsh–Hadamard Transform" (Andersson et al., 14 Jan 2026)
"A fast and exact algorithm for stabilizer Rényi entropy via XOR-FWHT" (Huang et al., 31 Dec 2025)
"HadaCore: Tensor Core Accelerated Hadamard Transform Kernel" (Agarwal et al., 2024)
"Hadamard Coded Modulation: An Alternative to OFDM for Optical Wireless Communications" (Noshad et al., 2014)
"Faster Walsh-Hadamard and Discrete Fourier Transforms From Matrix Non-Rigidity" (Alman et al., 2022)
"Faster Walsh-Hadamard Transform and Matrix Multiplication over Finite Fields using Lookup Tables" (Alman, 2022)
"A Fast Hadamard Transform for Signals with Sub-linear Sparsity in the Transform Domain" (Scheibler et al., 2013)