All-Prefix-Sum Algorithms

Updated 20 November 2025

All-Prefix-Sum algorithms are defined to compute sequential partial aggregates using an associative binary operation, serving as a core primitive in parallel computing.
They underpin diverse applications such as high-performance databases, AI accelerators, and dynamic programming, driving both theoretical analysis and practical optimizations.
Research explores variants including SIMD, GPU, and distributed methods, emphasizing hardware mapping, asymptotic optimality, and empirical cost trade-offs.

All-prefix-sum algorithms, also known as parallel scan algorithms, compute the sequence of partial aggregates (sums or more generally any associative binary operation) of an input array or distributed collection. These algorithms are central to parallel programming and underpin a wide range of primitives in high-performance computing, databases, and AI accelerators. Recent literature systematically investigates their algorithmic structure, asymptotic optimality, hardware mapping, and practical performance on modern CPUs, GPUs, accelerators, and distributed systems (Zhang et al., 2023, Särkkä et al., 13 Nov 2025, Pibiri et al., 2020, Wróblewski et al., 21 May 2025, Träff, 7 Jul 2025, Harrison et al., 2024). The following sections provide a rigorous exposition, following logical progression from sequential and static structures, through shared-memory and SIMD, to GPU and message-passing/distributed environments.

1. Formal Definition and Theoretical Foundations

Let $x[0\ldots n-1]$ be an array and $\oplus$ an associative operation. The all-prefix-sum (“scan”) problem is to compute:

$y[i] = x[0] \oplus x[1] \oplus \ldots \oplus x[i], \quad 0 \le i < n$

for inclusive scan, or $y[i] = x[0] \oplus \ldots \oplus x[i-1]$ for exclusive scan.

The information-theoretic lower bound for parallel prefix sum on $p$ processors is $\lceil \log_2 p \rceil$ communication rounds in the one-ported message-passing model (Träff, 7 Jul 2025).

Prefix-sum is a key primitive for:

Temporal and spatial parallelization in signal processing, dynamic programming, Kalman filters, and smoothers (Särkkä et al., 13 Nov 2025).
Database primitives (sorting, splitting, compact, filter, top- $k$ sampling) (Zhang et al., 2023, Wróblewski et al., 21 May 2025).
Foundational role in the design of data structures such as Fenwick trees, segment trees, and their high-branching variants (Pibiri et al., 2020, Harrison et al., 2024).

2. Sequential and Data Structure-Based Solutions

Classic data structures supporting dynamic prefix sums with updates include:

Structure	Space	Query/Update Time	Notable Features
Fenwick Tree	$\Theta(n)$	$O(\log_2 n)$	Minimal space, bit-level ops
Sierpinski Tree	$\Theta(n)$	$\oplus$ 0	Ternary branching, tight to lower bound, quantum lower bound compliance
$\oplus$ 1-ary Segment Tree	$\oplus$ 2	$\oplus$ 3	Highly vectorizable, optimal for wider SIMD (Pibiri et al., 2020)

Segment trees and Fenwick trees are practical for sustained queries and updates. The $\oplus$ 4-ary segment tree, for appropriate $\oplus$ 5, is empirically the fastest structure for all-prefix-sum on CPUs with advanced SIMD and deep cache hierarchies. The Sierpinski tree achieves the theoretically optimum logarithmic base for Fenwick-type structures, with $\oplus$ 6 query and update (Harrison et al., 2024).

3. Parallel and SIMD Shared-Memory Prefix Sum Algorithms

Shared-memory and SIMD scan algorithms operate in a data-parallel fashion, optimizing for in-core parallelism and cache locality. The main algorithms and their characteristics are:

Algorithm	Work Complexity	Span	Memory Access Pattern	Hardware Context
Horizontal (In-Register) SIMD	$\oplus$ 7	$\oplus$ 8	Contiguous, single-pass	CPUs with AVX–512, best per-core throughput (Zhang et al., 2023)
Vertical (Lane-Parallel) SIMD	$\oplus$ 9	$y[i] = x[0] \oplus x[1] \oplus \ldots \oplus x[i], \quad 0 \le i < n$ 0	Gather/scatter, two passes	CPUs with strong gather units
Tree/Blelloch SIMD	$y[i] = x[0] \oplus x[1] \oplus \ldots \oplus x[i], \quad 0 \le i < n$ 1 gather/scatter	$y[i] = x[0] \oplus x[1] \oplus \ldots \oplus x[i], \quad 0 \le i < n$ 2	Poor locality, strided	Theoretical span-optimal but high traffic (Zhang et al., 2023)
Multithreaded Two-Pass + Cache Partition	$y[i] = x[0] \oplus x[1] \oplus \ldots \oplus x[i], \quad 0 \le i < n$ 3	$y[i] = x[0] \oplus x[1] \oplus \ldots \oplus x[i], \quad 0 \le i < n$ 4	Partitioned, L2-confined	Multicore CPUs, bandwidth-limited (Zhang et al., 2023)

The horizontal SIMD method processes blocks in register using shift+add trees (Hillis–Steele style), best for small, per-core workloads. Vertical SIMD and balanced-tree variants are suited for architectures with efficient scatter/gather but can be bottlenecked by memory bandwidth. Cache-partitioned two-pass scans minimize RAM traffic by partitioning data into cache-sized tiles, essential at scale.

4. GPU and Accelerator-Based Parallel Scan Algorithms

On large-scale GPUs and specialized accelerators, all-prefix-sum methods exploit massive parallelism and often leverage unique hardware units:

Hillis–Steele: Baseline method, $y[i] = x[0] \oplus x[1] \oplus \ldots \oplus x[i], \quad 0 \le i < n$ 5 work, $y[i] = x[0] \oplus x[1] \oplus \ldots \oplus x[i], \quad 0 \le i < n$ 6 depth, competitive only for small $y[i] = x[0] \oplus x[1] \oplus \ldots \oplus x[i], \quad 0 \le i < n$ 7 due to high per-step overhead (Särkkä et al., 13 Nov 2025).
Blelloch Up-sweep/Down-sweep: Work-optimal $y[i] = x[0] \oplus x[1] \oplus \ldots \oplus x[i], \quad 0 \le i < n$ 8, $y[i] = x[0] \oplus x[1] \oplus \ldots \oplus x[i], \quad 0 \le i < n$ 9 depth, widely used in frameworks (JAX, TensorFlow), requires double-buffering (Särkkä et al., 13 Nov 2025).
Ladner–Fischer (In-place): Work-optimal and memory-efficient, best observed single-GPU performance, no extra buffers needed (Särkkä et al., 13 Nov 2025).
Sengupta Hybrid: Block-size tunable, combines tree-reduce and intra-block scans, facilitates occupancy tuning on GPUs, default for many block-based frameworks (Särkkä et al., 13 Nov 2025).
Matrix-Engine Scan (AI accelerators): Matrix multiplications (tile as $y[i] = x[0] \oplus \ldots \oplus x[i-1]$ 0), e.g., ScanU and ScanUL1, using cube/tensor units to accelerate scan dramatically versus vector-only methods. Up to $y[i] = x[0] \oplus \ldots \oplus x[i-1]$ 1 faster for large $y[i] = x[0] \oplus \ldots \oplus x[i-1]$ 2 (Wróblewski et al., 21 May 2025).

On multi-GPU systems, two-filter smoothers (parallel-in-time methods for Kalman smoothers) demonstrate that concurrent forward and backward scans can fully utilize hardware, outperforming standard methods by up to $y[i] = x[0] \oplus \ldots \oplus x[i-1]$ 3 (Särkkä et al., 13 Nov 2025).

5. Distributed and Message-Passing (MPI) Prefix Sum Algorithms

Distributed prefix sum, especially via MPI, must minimize communication rounds and processor-local reductions. The primary algorithms are:

Class	Rounds	Local $y[i] = x[0] \oplus \ldots \oplus x[i-1]$ 4 Ops	Remarks
Inclusive Doubling	$y[i] = x[0] \oplus \ldots \oplus x[i-1]$ 5	$y[i] = x[0] \oplus \ldots \oplus x[i-1]$ 6	Optimal for inclusive scan (Träff, 7 Jul 2025)
Shift-Based Exscan	$y[i] = x[0] \oplus \ldots \oplus x[i-1]$ 7	$y[i] = x[0] \oplus \ldots \oplus x[i-1]$ 8	Simple but sub-optimal round count
Two- $y[i] = x[0] \oplus \ldots \oplus x[i-1]$ 9 Doubling	$p$ 0	$p$ 1	Short rounds, double local ops
123-Doubling Exscan (new)	$p$ 2	$p$ 3	Achieves (almost) theoretical round minimum, fewest $p$ 4 (Träff, 7 Jul 2025)

Empirical MPI experiments show that the 123-doubling algorithm delivers a 25–30% performance improvement over standard MPI_Exscan for small vectors and expensive reductions, attaining nearly the lower bound in practice. For large vectors, pipelined tree scans with more rounds and smaller messages become necessary for bandwidth-limited settings.

6. Cost Analysis, Practical Trade-offs and Tuning

Asymptotics and practical performance diverge due to constant-factor effects, hardware, and workload patterns:

Memory hierarchy: Cache-partitioned, vectorized, and highly branched methods (e.g., $p$ 5-ary segment trees with large $p$ 6) excel as $p$ 7 grows and fit SIMD widths (Pibiri et al., 2020, Zhang et al., 2023).
SIMD and vectorization: SIMD-enhanced trees reduce operational latency proportionally to vector width; truncated and hybrid structures minimize cache conflicts and branch mispredictions.
Communication rounds: Lower bound is fundamental in message-passing but computation cost dominates for large reduction operators (Träff, 7 Jul 2025, Särkkä et al., 13 Nov 2025).
Matrix-based scans: On accelerators (Ascend, TPU, NVIDIA Tensor Cores) blockwise mat-muls amortize per-element scan cost by streamlining memory fetch and operator throughput (Wróblewski et al., 21 May 2025).
Data structure selection: For read-heavy workloads, $p$ 8-ary trees with large $p$ 9 dominate; for dynamic, memory-tight applications, Fenwick and Sierpinski trees are preferred.
Quantum lower bound: The Sierpinski tree achieves the tight theoretical bound for Fenwick-type structures, $\lceil \log_2 p \rceil$ 0 update/query (Harrison et al., 2024).

Practical guidance converges on matching algorithm structure to architectural characteristics and input size. For small arrays or short scans, simpler algorithms with minimal overhead are competitive. For large-scale, memory-bound, or bandwidth-saturated scenarios, partitioned, vectorized, and accelerator-optimized methods yield highest sustained throughput.

7. Extensions, Optimality, and Future Directions

Recent work establishes near-optimality in both asymptotic and practical senses:

Sierpinski tree achieves the optimal “weight” for dynamic scan structures per the quantum Pauli-weight lower bound (Harrison et al., 2024).
On AI accelerators, matrix-based scan methods generalize to other platforms with tensor-matrix units (Wróblewski et al., 21 May 2025).
For distributed-memory and heterogeneous systems, hierarchical or cross-chip scan methods are necessary for scaling to billions of elements.

Open research directions include:

Automated parameter tuning for $\lceil \log_2 p \rceil$ 1 in $\lceil \log_2 p \rceil$ 2-ary trees, tile/block size (e.g., $\lceil \log_2 p \rceil$ 3 in matrix-scan), and optimal cache partition thresholds (Pibiri et al., 2020, Zhang et al., 2023, Wróblewski et al., 21 May 2025).
Asynchronous and pipelined scan algorithms to mitigate global barriers and idle time in accelerator- and MPI-based settings (Wróblewski et al., 21 May 2025, Träff, 7 Jul 2025).
Extension of scan primitives to non-commutative operators, segmented and hierarchical scans, and quantum-compatible data structures.

All-prefix-sum (scan) remains an intensively studied and rapidly evolving primitive, with ongoing advances driven by both algorithmic insight and architectural innovation.