Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fast Walsh–Hadamard Transform (FWHT)

Updated 28 January 2026
  • FWHT is a divide-and-conquer algorithm that computes the Walsh–Hadamard transform using solely addition and subtraction, achieving O(N log N) complexity.
  • It employs in-place butterfly operations and exploits memory locality and parallelism for efficiency on modern CPU and GPU architectures.
  • FWHT underpins diverse applications in digital communications, quantum information processing, and compressed matrix multiplication with significant speedups.

The Fast Walsh–Hadamard Transform (FWHT) is an in-place, divide-and-conquer algorithm for multiplying an input vector by the Walsh–Hadamard matrix HNH_N, where NN is a power of two. The algorithm’s simplicity, reliance only on addition/subtraction, and optimal O(NlogN)O(N\log N) arithmetic complexity have made it a staple in domains ranging from digital communications and compressed linear algebra to quantum information processing.

1. Mathematical Definition and Structure

Let N=2nN = 2^n. The (unnormalized) Walsh–Hadamard matrix HN{±1}N×NH_N \in \{\pm1\}^{N\times N} is defined recursively as:

  • H1=[1]H_1 = [1]
  • H2n=(HnHn HnHn)H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}

Given xRNx\in \mathbb{R}^N, the FWHT computes the transform y=HNxy = H_N x, i.e., yi=j=0N1(HN)i,jxjy_i = \sum_{j=0}^{N-1} (H_N)_{i,j}x_j. Equivalently, its entries satisfy NN0, where NN1 is the bitwise dot product modulo 2. The normalized form, NN2, renders NN3 orthonormal: NN4, so NN5 (Andersson et al., 14 Jan 2026, Huang et al., 31 Dec 2025).

2. Core Algorithms and Fast Implementations

Standard In-Place “Butterfly” FWHT

The classical FWHT, also known as Yates’ algorithm, employs butterfly operations:

  • For each stage NN6 (from NN7 to NN8), pair elements whose indices differ at bit NN9.
  • For each such pair O(NlogN)O(N\log N)0, update to O(NlogN)O(N\log N)1.

Formally, for array O(NlogN)O(N\log N)2: H2n=(HnHn HnHn)H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}7 This routine requires O(NlogN)O(N\log N)3 additions/subtractions, no multiplications beyond O(NlogN)O(N\log N)4 (Huang et al., 31 Dec 2025, Noshad et al., 2014). The transform is entirely in-place, using O(NlogN)O(N\log N)5 auxiliary space.

Improved Arithmetic and Bit Complexities

Recent work has established theoretical improvements:

  • Operation count reduction: Using a decomposition of O(NlogN)O(N\log N)6 as the sum of a rank-O(NlogN)O(N\log N)7 and a sparse matrix, it is possible to reduce the leading constant for the FWHT to O(NlogN)O(N\log N)8 arithmetic operations, compared to the folklore O(NlogN)O(N\log N)9 (Alman et al., 2022).
  • Lookup tables for bit operations: Over constant-sized fields, precomputing and using lookup tables for small blocks enables an FWHT in N=2nN = 2^n0 bit operations (Alman, 2022).

3. Practical Optimizations and Parallelization

Efficient implementations leverage modern computer architectures:

  • In-place permutation and memory locality: Butterfly operations touch only contiguous memory, ensuring cache efficiency; no scatter/gather is required. For multi-dimensional applications, storing data so that axes are transformed in contiguous memory regions maximizes bandwidth (Georges et al., 2024, Huang et al., 31 Dec 2025, Andersson et al., 14 Jan 2026).
  • Multi-threaded and SIMD: The independence of different butterfly operations at each stage enables vectorization and threading. On GPUs, butterfly stages map naturally to SIMT warps/blocks. FWHT can be parallelized at both the butterfly and batch level (Agarwal et al., 2024, Huang et al., 31 Dec 2025, Andersson et al., 14 Jan 2026).
  • Hardware acceleration: Algorithms such as HadaCore exploit GPU Tensor Cores by restructuring the FWHT to map to hardware matrix-multiply-accumulate primitives, giving speedups up to N=2nN = 2^n1 for size-256 vectors. Larger transforms are tiled and organized via in-register transposes and shared-memory staging (Agarwal et al., 2024).

4. Statistical and Group-Theoretic Properties

The FWHT admits several crucial mathematical properties:

  • Self-inverse (up to scaling): N=2nN = 2^n2
  • Energy preservation: N=2nN = 2^n3, so the normalized transform is isometric.
  • Group convolution: On N=2nN = 2^n4, the FWHT diagonalizes the XOR-convolution:

N=2nN = 2^n5

under the transform N=2nN = 2^n6 (pointwise product), enabling N=2nN = 2^n7 convolution for functions on Boolean cubes (Huang et al., 31 Dec 2025, Andersson et al., 14 Jan 2026).

5. Major Application Domains

Quantum Information Processing

  • Pauli decomposition: All Pauli string coefficients of an arbitrary N=2nN = 2^n8 matrix N=2nN = 2^n9 can be computed in HN{±1}N×NH_N \in \{\pm1\}^{N\times N}0 time and HN{±1}N×NH_N \in \{\pm1\}^{N\times N}1 extra memory via an FWHT-based batched algorithm, which applies an in-place XOR-permutation, row-wise FWHT, and phase correction (Georges et al., 2024).
  • Stabilizer Rényi entropy: The computation of the second-order stabilizer entropy is reduced from HN{±1}N×NH_N \in \{\pm1\}^{N\times N}2 to HN{±1}N×NH_N \in \{\pm1\}^{N\times N}3 using HN{±1}N×NH_N \in \{\pm1\}^{N\times N}4 FWHTs of length HN{±1}N×NH_N \in \{\pm1\}^{N\times N}5, exploiting natural parallelism and in-place operations (Huang et al., 31 Dec 2025).

Compressed Matrix Computation

  • Compressed multiplication via sketching: Pagh’s algorithm for compressed matrix multiplication can substitute FFT with FWHT, preserving unbiasedness and variance bounds. In practice, the FWHT-based variant is up to HN{±1}N×NH_N \in \{\pm1\}^{N\times N}6 faster than FFT-based sketching and can yield HN{±1}N×NH_N \in \{\pm1\}^{N\times N}7 speedups over dense DGEMM (Intel MKL) when the product matrix is output-sparse (Andersson et al., 14 Jan 2026).

Digital Communications

  • Hadamard coded modulation (HCM): The FWHT enables efficient mapping of data onto Hadamard codewords, resulting in transmission schemes with low peak-to-average-power ratios (PAPR), outperforming OFDM in high-power and nonlinear regimes. Interleaving mitigates inter-symbol interference (ISI) in dispersive channels. HCM achieves PAPR HN{±1}N×NH_N \in \{\pm1\}^{N\times N}8, whereas OFDM’s PAPR grows with HN{±1}N×NH_N \in \{\pm1\}^{N\times N}9 (Noshad et al., 2014).

Sparse Signal Processing

  • Sparse FWHT (SparseFHT): For H1=[1]H_1 = [1]0-sparse signals in the Hadamard domain, a graph-code-inspired algorithm computes the WHT using H1=[1]H_1 = [1]1 time and H1=[1]H_1 = [1]2 samples, via subsampling with random hashing and a belief-propagation (peeling) decoder. This is strictly sub-linear in H1=[1]H_1 = [1]3 for H1=[1]H_1 = [1]4 (Scheibler et al., 2013).

6. Algorithmic Extensions and Specialized Methods

Table: Notable FWHT Variants and Theoretical Innovations

Paper Main Innovation Complexity Improvement
(Alman et al., 2022) Low-rank + sparse decomposition for FWHT H1=[1]H_1 = [1]5 ops
(Alman, 2022) Lookup tables for H1=[1]H_1 = [1]6-valued WHT H1=[1]H_1 = [1]7 bits
(Andersson et al., 14 Jan 2026) FWHT for compressed matrix sketching H1=[1]H_1 = [1]8 speedup over FFT
(Noshad et al., 2014) FWHT in HCM for PAPR reduction in communications PAPR H1=[1]H_1 = [1]9 vs. OFDM
(Scheibler et al., 2013) SparseFHT for H2n=(HnHn HnHn)H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}0-sparse signals H2n=(HnHn HnHn)H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}1

Algorithmic improvements include blockwise recursion with higher-radix kernels, leveraging matrix non-rigidity, tailored bit-block algorithms for finite fields, and the use of hardware-specific kernels (e.g., Tensor Core blocked FWHTs) (Agarwal et al., 2024, Alman et al., 2022, Alman, 2022).

7. Numerical Benchmarks and Implementation Comparisons

Empirical comparisons on CPUs (AMD EPYC, 64-core nodes) and GPUs (NVIDIA A100/H100) reveal that:

  • FWHT-based implementations outperform FFT for integer-valued and real transforms where only H2n=(HnHn HnHn)H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}2 arithmetic is needed.
  • Blocked, in-place, and hardware-optimized FWHT kernels yield H2n=(HnHn HnHn)H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}3 gains over previous libraries (e.g., Dao fast-hadamard-transform), with minimal numerical error even in low-precision FP16/BF16 (Agarwal et al., 2024, Andersson et al., 14 Jan 2026).
  • In Pauli decomposition for H2n=(HnHn HnHn)H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}4 qubits, the FWHT method is H2n=(HnHn HnHn)H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}5 to H2n=(HnHn HnHn)H_{2n} = \begin{pmatrix} H_n & H_n \ H_n & -H_n \end{pmatrix}6 faster than previous tensorized and C++ FWHT implementations (Georges et al., 2024).

References

  • "Pauli Decomposition via the Fast Walsh–Hadamard Transform" (Georges et al., 2024)
  • "Engineering Compressed Matrix Multiplication with the Fast Walsh–Hadamard Transform" (Andersson et al., 14 Jan 2026)
  • "A fast and exact algorithm for stabilizer Rényi entropy via XOR-FWHT" (Huang et al., 31 Dec 2025)
  • "HadaCore: Tensor Core Accelerated Hadamard Transform Kernel" (Agarwal et al., 2024)
  • "Hadamard Coded Modulation: An Alternative to OFDM for Optical Wireless Communications" (Noshad et al., 2014)
  • "Faster Walsh-Hadamard and Discrete Fourier Transforms From Matrix Non-Rigidity" (Alman et al., 2022)
  • "Faster Walsh-Hadamard Transform and Matrix Multiplication over Finite Fields using Lookup Tables" (Alman, 2022)
  • "A Fast Hadamard Transform for Signals with Sub-linear Sparsity in the Transform Domain" (Scheibler et al., 2013)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fast Walsh–Hadamard Transform (FWHT).