Parallel kd-trees (Pkd-tree)

Updated 24 January 2026

Parallel kd-trees (Pkd-tree) are multidimensional data structures designed for efficient parallel construction, dynamic batch updates, and spatial queries on modern architectures.
They employ advanced methods like multi-level sampling, cache-aware layouts, and GPU-optimized builds to achieve O(n log n) work and O(log² n) span in construction and updates.
Empirical benchmarks demonstrate significant speedups—up to 35–59× on multicore systems—with reduced cache complexity and memory footprints compared to sequential kd-trees.

A parallel kd-tree (Pkd-tree) is a data structure and algorithmic framework for multidimensional spatial data that enables efficient construction, batch-dynamic updates, and search queries (including k-nearest neighbor, range, and counting) on multicore, manycore, and distributed architectures. Compared to sequential kd-trees, Pkd-trees achieve strong theoretical and empirical bounds on work, parallel span, and cache complexity, scaling to billions of points and exploiting both thread- and data-parallelism (Men et al., 2024). This article surveys core methods and advancements in Pkd-tree algorithms, including batch-dynamic balancing, cache-aware and cache-oblivious layouts, reconstruction-based updates, and hybridization with Morton/Z-ordering. Principal contributions span: single-tree fully-parallel architectures, multi-tree log-structured batch-dynamic models, GPU-optimized left-balanced level-order builds, and distributed memory scalable constructions.

1. Theoretical Foundations, Notation, and Analytical Models

A kd-tree on $n$ points in $d$ dimensions is a binary tree recursively splitting points along axis-aligned hyperplanes: each interior node stores a splitter $(j,x)$ that divides its $|T|$ points into left ( $<$ ) and right ( $\ge$ ) by dimension $j$ . A Pkd-tree generalizes this structure for parallel and memory hierarchies. Key analytical metrics:

Work $W(n)$ : Total number of operations executed.
Span $S(n)$ : Parallel critical path length (longest dependency chain).
Cache Complexity $Q(n)$ : Number of cache lines accessed, under an ideal cache of size $d$ 0 and line size $d$ 1.
$d$ 2-weight balancing: For internal node $d$ 3 with $d$ 4 points, children satisfy $d$ 5.

In all state-of-the-art Pkd-tree methods, the goal is to achieve $d$ 6 work and $d$ 7 span for construction and batch updates, with $d$ 8 cache complexity for single- or multi-threaded settings (Men et al., 2024, Yesantharao et al., 2021).

2. Parallel Construction Techniques

Multi-Level Sampling and Sieve-Based Build

Pkd-tree construction proceeds by building a top-λ-level "tree skeleton" via sampling, partitioning input points into $d$ 9 buckets in a single parallel scan and prefix-sum pass, then recursively building subtrees in parallel. This approach achieves parallel work $(j,x)$ 0, span $(j,x)$ 1, and optimal $(j,x)$ 2 cache complexity (Men et al., 2024). Under this regime:

The top skeleton is constructed by sampling $(j,x)$ 3 points, recursively splitting on the dimension of maximal spread.
Filtering (sieving) scatters all $(j,x)$ 4 points into contiguous bucket slices via a parallel histogram and prefix sum (column-major order).
Base-case leaves switch to cache-oblivious or cache-aware sequential builds at threshold size.

In GPU environments, left-balanced complete kd-trees can be constructed in-place using $(j,x)$ 5 iterations, each performing a parallel sort (by a "tag" and split coordinate) and update kernel, with minimal per-point auxiliary storage. This exploits coalesced memory access for update kernels and achieves nearly ideal scaling (Wald, 2022).

Presorted and Morton/Z-Order Approaches

Alternative Pkd-tree builds utilize presorting index arrays in each dimension and recursive partitioning without further per-split sorts, achieving $(j,x)$ 6 work and efficient parallelization for small $(j,x)$ 7 ("presort Pkd-tree") (Brown, 2014). The Morton/Z-order "zd-tree" hybrid maps points to a layout that preserves spatial locality, allowing recursive parallel splits based on Morton codes and $(j,x)$ 8 parallel work (Dobson et al., 2021). This hybrid is especially cache-efficient in low dimensions.

Distributed-Memory Construction

For distributed settings, scalable Pkd-tree construction is achieved using Map–Reduce passes to collect truncated orthogonal moment statistics, which enable approximate median splits in a single global communication round, with strong balancing guarantees for high-data-volume, high-node-count regimes (Chakravorty et al., 2022).

3. Batch-Dynamic Updates and Weight-Balanced Maintenance

Batch updates (bulk insert and delete) in Pkd-trees are managed by reconstruction-based balancing. For each batch insertion or deletion, affected subtrees are identified via sieving, and only those violating a specified $(j,x)$ 9-weight-balance are reconstructed in parallel, ensuring amortized $|T|$ 0 work and $|T|$ 1 span (Men et al., 2024). The key steps:

Sieve incoming batch to affected regions/buckets in the tree.
Traverse skeleton top-down. If weight-balance violated at any node, flatten and rebuild that subtree in parallel; otherwise, recurse both subtrees.
Only $|T|$ 2 rebuilds per root-leaf path occur in expectation.
Deletions sieve to identify points present in the region and remove accordingly, triggering local rebuilds as needed.

The BDL-tree ("Batch-Dynamic Log-structured k-d tree") models batch updates via a collection of static kd-trees with exponentially increasing capacities and a small buffer. Binary "carry" arithmetic is used to determine restructuring, rebuilding only affected trees in parallel (Yesantharao et al., 2021, Wang et al., 2022). Empirically, BDL-trees reach up to $|T|$ 3 parallel speedup and perform millions of insertions/deletions per second on 36+ core machines.

4. Query Algorithms: k-Nearest Neighbor, Range, and Counting

Standard search queries (including $|T|$ 4-NN, range, and range-count) map naturally onto the Pkd-tree, leveraging its balanced depth and cache-aware layout:

k-Nearest Neighbor (k-NN) Search: DFS with branch-and-bound pruning on split hyperplanes, maintaining a bounded max-heap of candidates. Per-query complexity remains $|T|$ 5, and batched queries are evaluated in parallel with thread-local buffers (Men et al., 2024). In the zd-tree variant, best-first-order search with Morton order provides further cache locality (Dobson et al., 2021).
Range and Range-Count Queries: Orthogonal queries descend the tree, recursively exploring subtrees whose geometric region intersects with the query box. If a subtree's region is fully contained, all points may be reported or counted without further recursion. Per-query cost is $|T|$ 6 for reporting, $|T|$ 7 for counting (Men et al., 2024).
Monte Carlo Stratified Sampling: Implicit kd-tree stratification enables generation of $|T|$ 8 nearly low-discrepancy samples in $|T|$ 9, with each sample independently generated in $<$ 0 time in parallel, supporting high-dimensional graphics applications (Keros et al., 2020).

5. Hardware-Specific and Hybrid Parallelism

GPU Architectures

GPU-optimized algorithms exploit massive SIMD parallelism and global memory bandwidth:

Almost-in-place level-order construction, where each thread re-tags points based on current subtree membership and split coordinate, using efficient global key-value radix sort and per-point integer tags (Wald, 2022).
Parallel traversals for force computations and neighbor searches, where per-particle tree-walks are performed in parallel with no inter-thread dependencies. Localized Morton ordering maximizes cache-line reuse. Achieved speedups of 2.4–2.5× over brute-force methods on 800K+ particles (Nakasato, 2011).

Multi-GPU and Chunked Processing

Buffer k-d trees distribute queries and/or reference data in host-managed buffers, employing parallel CPU traversals for mapping queries to leaves and then batching brute-force neighbor searches on each GPU device. Chunking and pipelining host-device memory transfers achieve near-linear multi-GPU speedup with minimal overhead: slowdowns $<$ 15% for 10+ million points (Gieseke et al., 2015).

6. Experimental Evaluation and Empirical Benchmarks

Recent implementations of Pkd-tree and BDL-tree achieve:

Construction times: 3.65 s (Pkd) vs. 45 s (single-tree ParGeo Log-tree) vs. 1079 s (CGAL) for $<$ 2 points in 3D (Men et al., 2024).
Batch insert (1%): 0.107 s (Pkd) vs. 2.66 s (Log-tree), 40.3 s (BDL-tree), 1815 s (CGAL).
Batch delete (1%): 0.134 s (Pkd), 0.485 s (Log-tree), 39.3 s (BDL-tree), 41.3 s (CGAL).
k-NN search (10-NN on $<$ 3 queries): 0.822 s (Pkd), 4.48 s (Log-tree), 1.02 s (BDL-tree), 2.30 s (CGAL).
Range-report and range-count operations show up to $<$ 4 speedup in Pkd versus range-report counting.
Parallel speedup: up to $<$ 5– $<$ 6 on $<$ 7 cores for build/insert/delete, with parallel efficiency maintained at large $<$ 8.

Cache-aware design reduces L2 misses ( $<$ 9– $\ge$ 0 on $\ge$ 1 points) and overall memory footprint (10–12GB for Pkd; $\ge$ 2– $\ge$ 3 less than alternatives).

The balancing parameter $\ge$ 4 modulates update/query trade-off. For $\ge$ 5, update cost remains $\ge$ 6 build, and query cost degrades by $\ge$ 7.

7. Limitations, Applications, and Future Directions

Current Pkd-tree implementations are optimized for in-memory settings; extensions to out-of-core, distributed-memory (MPI/MPC), or persistent/NVM environments remain open. GPU adaptation of the sieve and parallel recursion for extremely large data sets is suggested.

Applications include:

Large-scale spatial and similarity search in scientific data analysis (e.g., astrophysics, plasma, and particle physics) (Patwary et al., 2016).
Computational geometry (ParGeo), high-dimensional sampling (Monte Carlo renderers), real-time analytics, and streaming scenarios.
Distributed frameworks for decision trees, clustering, and locality-sensitive structures in data science (Chakravorty et al., 2022).

Possible future work includes support for dynamic or variable dimensions, approximate queries, density and clustering extensions, succinct/space-optimized layouts, and multi-version concurrency.

Key references:

"Parallel $\ge$ 8d-tree with Batch Updates" (Men et al., 2024)
"Parallel Batch-Dynamic $\ge$ 9d-Trees" (Yesantharao et al., 2021), "ParGeo" (Wang et al., 2022)
"GPU-friendly, Parallel, and (Almost-)In-Place Construction of Left-Balanced k-d Trees" (Wald, 2022)
"Parallel Nearest Neighbors in Low Dimensions with Batch Updates" (Dobson et al., 2021)
"Bigger Buffer k-d Trees on Multi-Many-Core Systems" (Gieseke et al., 2015)
"Scalable $j$ 0-d trees for distributed data" (Chakravorty et al., 2022)
"Implementation of a Parallel Tree Method on a GPU" (Nakasato, 2011)
"Building a Balanced k-d Tree in O(kn log n) Time" (Brown, 2014)
"Jittering Samples using a kd-Tree Stratification" (Keros et al., 2020)