DET-LSH (DE-Tree): Dynamic Trees for ANN

Updated 17 February 2026

DET-LSH is a family of LSH schemes that uses dynamic encoding trees (DE-Trees) to achieve efficient approximate nearest neighbor search in high-dimensional spaces.
It constructs adaptive, balanced tree indexes by partitioning multi-dimensional projections into symbolic codes, resulting in reduced indexing cost and improved query performance.
Empirical evaluations show up to 32× indexing speedup and over 90% recall across large datasets, demonstrating significant advantages over traditional LSH methods.

DET-LSH (Dynamic Encoding Tree Locality-Sensitive Hashing, "DE-Tree")

Dynamic Encoding Tree Locality-Sensitive Hashing (DET-LSH) refers to a family of Locality-Sensitive Hashing (LSH) schemes utilizing adaptive, encoding-driven tree structures—termed Dynamic Encoding Trees (DE-Trees)—for efficient approximate nearest neighbor (ANN) search in high-dimensional spaces. These approaches are characterized by hierarchical, dynamically balanced tree indexes which encode multi-dimensional projections into symbolic representations. DET-LSH and related DE-Tree methods address the inefficiencies of traditional LSH by reducing indexing cost and supporting efficient, high-recall queries, with rigorous theoretical and empirical validation on massive datasets (Wei et al., 2024, Davoodi et al., 2019).

1. Foundations: LSH, ANN, and Distribution-Sensitive Search

DET-LSH emerges within the context of LSH, which provides sublinear-time (r,c)-ANN search in high-dimensional metric spaces via randomized hash families mapping similar points to the same buckets with higher probability than dissimilar points. Traditional LSH variants, oriented toward metric adaptivity (e.g., p-stable LSH for $\ell_2$ ), generally employ simple hash tables or splitting trees for index structure, but do not optimize the efficiency or adaptivity of the indexing mechanism itself.

In contrast, DET-LSH and the DE-Tree family generalize LSH to:

Index both continuous real-valued and discrete probability distributions.
Optimize both the partitioning of the index space and the query mechanism for theoretical and empirical gains in speed and recall.
Incorporate adaptive partitioning (dynamic encoding) to align tree structure with data distribution or joint query/data model (Wei et al., 2024, Davoodi et al., 2019).

2. Dynamic Encoding Tree (DE-Tree) Structure and Construction

A DE-Tree is a balanced, adaptive hierarchical index whose nodes correspond to multi-dimensional symbolic code ranges. Key construction steps and features:

Dynamic iSAX-based Encoding:

Each dimension is projected (e.g., via p-stable hash functions for $\ell_2$ ), then independently divided into $N_r$ equiprobable regions by computing approximate quantiles via sampling and recursive QuickSelect. This yields a K-dimensional symbolic code (iSAX code) with one symbol per projection per data point. Encoding for new points involves a binary search among breakpoints for each axis.

Tree Construction:

The root has $2^K$ children corresponding to the 1-bit iSAX expansions. Nodes store K symbolic code-ranges defining axis-aligned regions.
Insertion descends via code-ranges; leaves are split when capacity exceeds $max\_size$ , refining the symbolic code in the dimension yielding the most balanced split.
Splitting increases codeword length by 1 in a single dimension; depth is bounded by $K\log_2 N_r$ . Indexing cost is $O(n(d + \log N_r))$ with $O(n)$ space for $n$ points in $\mathbb{R}^d$ (Wei et al., 2024).

Distributed Forests (ForestDSH):

For tasks involving discrete alphabets $A^S$ (e.g., query and database matching under a joint distribution $P$ ), DE-Trees are generalized to forests of decision trees ("forest DSH"). Node acceptance, splitting, or pruning is determined by likelihood-based thresholds involving $P$ and marginal distributions, guaranteeing collision probabilities align with true query-data associations (Davoodi et al., 2019).

3. DET-LSH Query Algorithms and Theoretical Guarantees

DET-LSH integrates the DE-Tree index with LSH query strategies to optimize search accuracy and cost:

Projections and Encoding: For each query $q$ , $L$ sets of $K$ projections are computed and encoded using stored breakpoints.
Range Queries: Each DE-Tree is traversed, visiting all $2^K$ root-children. Efficient lower-bounding and upper-bounding of projected $\ell_2$ distances are used for subtree pruning and leaf candidate extraction.
Candidate Verification: Candidates collected across all trees are merged; early stopping occurs if the candidate set exceeds $\beta n$ , with final selection via exact $\ell_2$ verification.
Correctness (r,c)-ANN: Provided parameter choices for the projection contraction factor $\epsilon$ , number of tables $L$ , and candidate threshold $\beta$ , DET-LSH guarantees $(r,c)$ -ANN with probability at least $1/2 - 1/e$. The probability that a true neighbor is not found in any tree is $e^{-1}$ for typical parameter settings, and empirical recall exceeds 90% (Wei et al., 2024). Query time is $O(n(\beta d+\log N_r))$ .

4. Algorithmic Complexity and Collision Analysis

The DET-LSH index achieves $O(n)$ space and $O(n(d+\log N_r))$ build time. Query complexity is dominated in practice by the number of candidates examined and the depth of the DE-Trees. Analytical bounds and empirical studies confirm:

Indexing throughput up to $200$k inserts/sec (vs. $1$k/sec for HNSW).
Query speedup up to $2\times$ versus leading LSH baselines with equivalent recall.
Multitable DE-Tree structure significantly reduces the probability of missing exact nearest neighbors while keeping false positive rate (random collisions) controllable by $\beta$ and $L$ (Wei et al., 2024, Davoodi et al., 2019).

Collision probabilities in each tree are rigorously analyzed via LSH theory and, for distributional search contexts, via the ForestDSH exponent $\lambda^*(P)$ , with total computational cost $O(N^{\lambda^*})$ (Davoodi et al., 2019). True-positive and false-positive rates across bands or tables are explicitly controlled.

5. Empirical Performance and Comparative Study

DET-LSH and its DE-Tree variants have been benchmarked across real and synthetic datasets:

Dataset	Indexing Speedup	Query Speedup	Recall
Sift100M	4×	2×	0.98
Tiny80M	6×	1.8×	0.91
SPACEV500M	32× (vs. DB-LSH)	2.3×	0.96

DET-LSH achieves 3× smaller index size and recall >90% on large-scale benchmarks.
Update throughput is two orders of magnitude higher than HNSW.
Parameter studies confirm optimal $max\_size$ in $[256,512]$ and that $L=4$ , $K=16$ , $N_r=256$ , $\beta \approx 0.1$ provide robust trade-offs (Wei et al., 2024).

In discrete settings, ForestDSH is shown to achieve $8\times$ speedup over brute-force and $7\times$ over classical LSH in mass spectrometry search, maintaining equal search quality (Davoodi et al., 2019). For data-dependent similarity queries—such as merge tree or structured descriptor comparison—DET-LSH-style range-search with DE-Trees also provides significant advantage for efficient retrieval (Lyu et al., 2024).

6. Practical Recommendations, Limitations, and Future Directions

Best practices for DET-LSH include fixing $K=16$ , $L=4$ , $N_r=256$ , and $c=1.5$ for $\ell_2$ tasks; increasing $L$ or relaxing $\beta$ for higher recall; and setting $max\_size$ appropriately to minimize leaf scans. Encoding and tree construction are parallelizable across tables and axes.

Limitations include:

Theoretical recall bound $1/2-1/e$ is modest; higher recall requires more tables or data-dependent projection schemes (e.g., PCA+LSH).
Online adaptation to non-stationary data is not yet supported; there is currently no mechanism for dynamic rebalancing.
SIMD or GPU optimization of encoding (especially sampling and QuickSelect) remains an open hardware acceleration challenge.
Generalization to alternative distance metrics (cosine, Jaccard) depends on the feasibility of appropriate symbolic encoding schemes (Wei et al., 2024).

7. Extensions in Distribution-Sensitive and Topological Search

DE-Trees and DET-LSH generalize naturally to discrete and structured data beyond Euclidean $\ell_2$ :

ForestDSH (Distribution-Sensitive Hashing): Implements forests of DE-Trees parameterized on joint/marginal distributions, optimizing true collision rates and cost exponents $\lambda^*$ for rigorous theoretical efficiency and robustness to noise in the learned distribution (Davoodi et al., 2019).
Merge Tree Analysis in Topological Data: DET-LSH variants enable efficient similarity search and clustering for combinatorial tree structures (e.g., merge trees, persistence diagrams) using banded LSH signatures derived from bottom-up min-hash or subpath statistics (Lyu et al., 2024). This achieves $10-30\times$ speedup versus edit or interleaving distance computation, with empirical precision and recall comparable to exact algorithms.

Overall, DET-LSH advances the state-of-the-art in high-dimensional ANN and structured similarity search by combining interpretability, rigorous guarantees, and massive empirical scale through the adaptive, encoding-based DE-Tree index framework.