Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fast Walsh–Hadamard Transform (FWHT)

Updated 28 January 2026
  • FWHT is a divide-and-conquer algorithm that computes the Walsh–Hadamard transform using solely addition and subtraction, achieving O(N log N) complexity.
  • It employs in-place butterfly operations and exploits memory locality and parallelism for efficiency on modern CPU and GPU architectures.
  • FWHT underpins diverse applications in digital communications, quantum information processing, and compressed matrix multiplication with significant speedups.

The Fast Walsh–Hadamard Transform (FWHT) is an in-place, divide-and-conquer algorithm for multiplying an input vector by the Walsh–Hadamard matrix HNH_N, where NN is a power of two. The algorithm’s simplicity, reliance only on addition/subtraction, and optimal O(NlogN)O(N\log N) arithmetic complexity have made it a staple in domains ranging from digital communications and compressed linear algebra to quantum information processing.

1. Mathematical Definition and Structure

Let N=2nN = 2^n. The (unnormalized) Walsh–Hadamard matrix HN{±1}N×NH_N \in \{\pm1\}^{N\times N} is defined recursively as:

  • H1=[1]H_1 = [1]
  • H2n=(HnHn HnHn)H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}

Given xRNx\in \mathbb{R}^N, the FWHT computes the transform y=HNxy = H_N x, i.e., yi=j=0N1(HN)i,jxjy_i = \sum_{j=0}^{N-1} (H_N)_{i,j}x_j. Equivalently, its entries satisfy yk=j=0N1(1)k,jxjy_k = \sum_{j=0}^{N-1} (-1)^{\langle k,j\rangle} x_j, where k,j\langle k,j\rangle is the bitwise dot product modulo 2. The normalized form, (1/N)HN(1/\sqrt{N})H_N, renders HNH_N orthonormal: HNHN=NINH_NH_N^\top = N I_N, so HN1=(1/N)HNH_N^{-1} = (1/N)H_N (Andersson et al., 14 Jan 2026, Huang et al., 31 Dec 2025).

2. Core Algorithms and Fast Implementations

Standard In-Place “Butterfly” FWHT

The classical FWHT, also known as Yates’ algorithm, employs butterfly operations:

  • For each stage \ell (from $0$ to n1n-1), pair elements whose indices differ at bit \ell.
  • For each such pair (a,b)(a, b), update to (a+b,ab)(a+b, a-b).

Formally, for array FCNF \in \mathbb{C}^{N}:

1
2
3
4
5
6
7
8
9
10
11
def FWHT_inplace(F):
    N = len(F)
    half = 1
    while half < N:
        for i in range(0, N, 2*half):
            for j in range(half):
                u = F[i+j]
                v = F[i+j+half]
                F[i+j] = u + v
                F[i+j+half] = u - v
        half *= 2
This routine requires O(NlogN)O(N\log N) additions/subtractions, no multiplications beyond ±1\pm1 (Huang et al., 31 Dec 2025, Noshad et al., 2014). The transform is entirely in-place, using O(1)O(1) auxiliary space.

Improved Arithmetic and Bit Complexities

Recent work has established theoretical improvements:

  • Operation count reduction: Using a decomposition of H8H_8 as the sum of a rank-$1$ and a sparse matrix, it is possible to reduce the leading constant for the FWHT to (23/24)Nlog2N+O(N)(23/24)N \log_2 N + O(N) arithmetic operations, compared to the folklore Nlog2NN\log_2 N (Alman et al., 2022).
  • Lookup tables for bit operations: Over constant-sized fields, precomputing and using lookup tables for small blocks enables an FWHT in O(NlogN/loglogN)O(N\log N/\log\log N) bit operations (Alman, 2022).

3. Practical Optimizations and Parallelization

Efficient implementations leverage modern computer architectures:

  • In-place permutation and memory locality: Butterfly operations touch only contiguous memory, ensuring cache efficiency; no scatter/gather is required. For multi-dimensional applications, storing data so that axes are transformed in contiguous memory regions maximizes bandwidth (Georges et al., 2024, Huang et al., 31 Dec 2025, Andersson et al., 14 Jan 2026).
  • Multi-threaded and SIMD: The independence of different butterfly operations at each stage enables vectorization and threading. On GPUs, butterfly stages map naturally to SIMT warps/blocks. FWHT can be parallelized at both the butterfly and batch level (Agarwal et al., 2024, Huang et al., 31 Dec 2025, Andersson et al., 14 Jan 2026).
  • Hardware acceleration: Algorithms such as HadaCore exploit GPU Tensor Cores by restructuring the FWHT to map to hardware matrix-multiply-accumulate primitives, giving speedups up to 3.5×3.5\times for size-256 vectors. Larger transforms are tiled and organized via in-register transposes and shared-memory staging (Agarwal et al., 2024).

4. Statistical and Group-Theoretic Properties

The FWHT admits several crucial mathematical properties:

  • Self-inverse (up to scaling): HN1=(1/N)HNH_N^{-1} = (1/N)H_N
  • Energy preservation: HNx22=Nx22\|H_N x\|_2^2 = N\|x\|_2^2, so the normalized transform is isometric.
  • Group convolution: On Z2nZ_2^n, the FWHT diagonalizes the XOR-convolution:

(fg)(w)=xy=wf(x)g(y)(f \oplus g)(w) = \sum_{x \oplus y = w} f(x)g(y)

under the transform fg^=f^g^\widehat{f \oplus g} = \hat{f} \cdot \hat{g} (pointwise product), enabling O(NlogN)O(N\log N) convolution for functions on Boolean cubes (Huang et al., 31 Dec 2025, Andersson et al., 14 Jan 2026).

5. Major Application Domains

Quantum Information Processing

  • Pauli decomposition: All Pauli string coefficients of an arbitrary 2n×2n2^n \times 2^n matrix AA can be computed in O(N2logN)O(N^2\log N) time and O(1)O(1) extra memory via an FWHT-based batched algorithm, which applies an in-place XOR-permutation, row-wise FWHT, and phase correction (Georges et al., 2024).
  • Stabilizer Rényi entropy: The computation of the second-order stabilizer entropy is reduced from O(8N)O(8^N) to O(N4N)O(N4^N) using 2N2^N FWHTs of length 2N2^N, exploiting natural parallelism and in-place operations (Huang et al., 31 Dec 2025).

Compressed Matrix Computation

  • Compressed multiplication via sketching: Pagh’s algorithm for compressed matrix multiplication can substitute FFT with FWHT, preserving unbiasedness and variance bounds. In practice, the FWHT-based variant is up to 4×4\times faster than FFT-based sketching and can yield 40×40\times speedups over dense DGEMM (Intel MKL) when the product matrix is output-sparse (Andersson et al., 14 Jan 2026).

Digital Communications

  • Hadamard coded modulation (HCM): The FWHT enables efficient mapping of data onto Hadamard codewords, resulting in transmission schemes with low peak-to-average-power ratios (PAPR), outperforming OFDM in high-power and nonlinear regimes. Interleaving mitigates inter-symbol interference (ISI) in dispersive channels. HCM achieves PAPR =2=2, whereas OFDM’s PAPR grows with logN\log N (Noshad et al., 2014).

Sparse Signal Processing

  • Sparse FWHT (SparseFHT): For KK-sparse signals in the Hadamard domain, a graph-code-inspired algorithm computes the WHT using O(KlogKlog(N/K))O(K\log K\,\log(N/K)) time and O(Klog(N/K))O(K\log(N/K)) samples, via subsampling with random hashing and a belief-propagation (peeling) decoder. This is strictly sub-linear in NN for KNK\ll N (Scheibler et al., 2013).

6. Algorithmic Extensions and Specialized Methods

Table: Notable FWHT Variants and Theoretical Innovations

Paper Main Innovation Complexity Improvement
(Alman et al., 2022) Low-rank + sparse decomposition for FWHT 23/24NlogN23/24\,N\log N ops
(Alman, 2022) Lookup tables for FqF_q-valued WHT O(NlogN/loglogN)O(N\log N/\log\log N) bits
(Andersson et al., 14 Jan 2026) FWHT for compressed matrix sketching 24×2\text{--}4\times speedup over FFT
(Noshad et al., 2014) FWHT in HCM for PAPR reduction in communications PAPR =2=2 vs. OFDM
(Scheibler et al., 2013) SparseFHT for KK-sparse signals O(KlogKlog(N/K))O(K\,\log K\log(N/K))

Algorithmic improvements include blockwise recursion with higher-radix kernels, leveraging matrix non-rigidity, tailored bit-block algorithms for finite fields, and the use of hardware-specific kernels (e.g., Tensor Core blocked FWHTs) (Agarwal et al., 2024, Alman et al., 2022, Alman, 2022).

7. Numerical Benchmarks and Implementation Comparisons

Empirical comparisons on CPUs (AMD EPYC, 64-core nodes) and GPUs (NVIDIA A100/H100) reveal that:

  • FWHT-based implementations outperform FFT for integer-valued and real transforms where only ±1\pm1 arithmetic is needed.
  • Blocked, in-place, and hardware-optimized FWHT kernels yield 1.13.6×1.1\text{--}3.6\times gains over previous libraries (e.g., Dao fast-hadamard-transform), with minimal numerical error even in low-precision FP16/BF16 (Agarwal et al., 2024, Andersson et al., 14 Jan 2026).
  • In Pauli decomposition for n5n\geq5 qubits, the FWHT method is 1.4×1.4\times to 3.6×3.6\times faster than previous tensorized and C++ FWHT implementations (Georges et al., 2024).

References

  • "Pauli Decomposition via the Fast Walsh–Hadamard Transform" (Georges et al., 2024)
  • "Engineering Compressed Matrix Multiplication with the Fast Walsh–Hadamard Transform" (Andersson et al., 14 Jan 2026)
  • "A fast and exact algorithm for stabilizer Rényi entropy via XOR-FWHT" (Huang et al., 31 Dec 2025)
  • "HadaCore: Tensor Core Accelerated Hadamard Transform Kernel" (Agarwal et al., 2024)
  • "Hadamard Coded Modulation: An Alternative to OFDM for Optical Wireless Communications" (Noshad et al., 2014)
  • "Faster Walsh-Hadamard and Discrete Fourier Transforms From Matrix Non-Rigidity" (Alman et al., 2022)
  • "Faster Walsh-Hadamard Transform and Matrix Multiplication over Finite Fields using Lookup Tables" (Alman, 2022)
  • "A Fast Hadamard Transform for Signals with Sub-linear Sparsity in the Transform Domain" (Scheibler et al., 2013)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fast Walsh–Hadamard Transform (FWHT).