Papers
Topics
Authors
Recent
Search
2000 character limit reached

LEANN: Low-Storage Vector Index

Updated 16 January 2026
  • LEANN is a low-storage vector index that prunes HNSW-style graphs to reduce storage below 5% of raw data while maintaining over 90% top-3 recall.
  • It eliminates persistent full-precision embeddings by using on-the-fly recomputation and dynamic batching, optimizing both memory and compute resources.
  • Empirical evaluations show LEANN achieves sub-2 second query latency and competitive recall, making it ideal for embedding-based search on personal devices.

LEANN (Low-Storage Efficient Approximate Nearest Neighbor) is a vector index designed for embedding-based search in resource-constrained environments, such as personal devices. It addresses the prohibitive storage requirements of standard ANN (Approximate Nearest Neighbor) indices by employing an aggressively pruned proximity graph and on-the-fly embedding recomputation, enabling high-recall and low-latency search with storage overhead below 5% of the raw data size (Wang et al., 9 Jun 2025).

1. Motivation and Background

Embedding-based search, integral to applications like recommendation and retrieval-augmented generation (RAG), utilizes high-dimensional vectors to encode semantic similarity. Standard ANN indices, including graph-based HNSW and cluster-based Faiss IVF, incur substantial storage overhead due to both full-precision embeddings and index metadata (neighbor lists, cluster assignments). Typical overheads range between 150–700% of the raw document size—for instance, 100 GB of data may require 150–700 GB in ANN index storage—which is manageable in datacenter-scale infrastructure but impractical for personal devices where SSD/flash storage is at a premium. The design objective for LEANN is to reduce index storage below 5% of raw data while sustaining greater than 90% top-3 recall and sub-2 s query latency on real-world QA benchmarks (Wang et al., 9 Jun 2025).

2. Architecture and Data Structures

LEANN’s architecture is based on a pruned, HNSW-style proximity graph and leverages on-the-fly recomputation to eliminate full-precision embedding storage post-construction.

  • Node Representation: Each node corresponds to a document chunk (e.g., 256 tokens), and is identified by a unique integer ID.
  • Edges and Storage: Graph edges are stored in a compressed sparse row (CSR) format, with each neighbor link requiring 4 bytes. LEANN applies selective degree pruning, preserving a small fraction (a%a\%) of high-degree “hub” nodes at maximum degree MM and reducing all other nodes to a much smaller degree mMm \ll M.
  • Storage Complexity: For nn chunks and node degrees DiD_i, the graph’s on-disk size is given by Sgraph=i=1nDi×4S_{\mathrm{graph}} = \sum_{i=1}^n D_i \times 4 bytes, with Sgraph/Sraw<5%S_{\mathrm{graph}}/S_{\mathrm{raw}} < 5\% in practice. As an example, for 76 GB of raw data, the pruned LEANN graph occupies approximately 3 GB.
  • Time Complexity: Index construction is dominated by the O(nlogn)O(n\log n) cost of building the initial HNSW graph. Query time involves best-first traversal of ef\mathrm{ef} nodes. Each traversal step can trigger recomputation of up to i=1efDi\sum_{i=1}^{\mathrm{ef}} D_i embeddings. An embedding server processes BB chunks/s, yielding expected query latency Tqueryi=1efDiB+ToverheadT_{\mathrm{query}} \approx \frac{\sum_{i=1}^{\mathrm{ef}} D_i}{B} + T_{\mathrm{overhead}}.

The index construction and query routines are formally described in two core algorithms:

  • Algorithm 1: Preserves high-degree nodes during graph pruning by maintaining the top a%a\% of hubs at degree MM, while others are pruned to lower degree mm via best-first neighbor selection and bidirectional edge creation, subject to incoming edge limits.
  • Algorithm 2: Implements a two-level search strategy with dynamic batching. An initial pool of candidate nodes is expanded by pushing approximate-distance-computed neighbors into a queue, from which the top a%a\% are selected for exact distance computation in batches of size BB (typically 64–128, tuned for hardware throughput), thus balancing I/O and compute demands.

3. On-the-Fly Recomputation and Batching

A defining feature of LEANN is its complete elimination of persistent embedding vectors for all but a minuscule, optionally cached fraction of hub nodes. After offline index build, embeddings are discarded; at query time, the index uses lightweight PQ-compressed data (ca. 2 GB for screening) and dynamically batches the recomputation workload for only the top a%a\% of candidates, leveraging device GPUs to maximize throughput.

This approach introduces a storage/computation trade-off, governed by parameters ef\mathrm{ef}, aa, BB, MM, and mm. Increasing ef\mathrm{ef} or embedding dimensionality raises recomputation cost, but high batch sizes and lightweight embedding models can offset increased latency. The system design allows practitioners to tune these hyperparameters to meet specific accuracy and latency requirements while enforcing strict storage budgets. For even more aggressive memory management, caching a small percentage of embeddings for high-degree nodes on local flash/NVMe enables further improvement at the cost of increased SSD I/O, which can become a bottleneck at high hit rates (Wang et al., 9 Jun 2025).

4. Empirical Performance

Evaluation on a 76 GB corpus (60 million chunks) demonstrates substantial gains in storage efficiency without material loss in search quality. The following table summarizes ANN methods on an NVIDIA A10 platform:

Method Size (% of 76 GB) Recall@3 Latency (s)
BM25 0.65 1.2
HNSW (in-mem) 225% (171 GB) 0.90 0.04
DiskANN 250% (190 GB) 0.90 0.05
IVF-Disk 225% (171 GB) 0.90 0.30
Edge-RAG (IVF rc) <0.1% 0.90 40.0
LEANN 3.8% 0.90 1.8

LEANN consistently remained below 5% storage overhead and achieved over 90% recall@3 with query latency under 2 seconds across four QA benchmarks (NQ, TriviaQA, HotpotQA, GPQA). On Mac hardware, other ANN indices were out of memory or exhibited substantially higher resource requirements. Downstream RAG accuracy with Llama-3.2-1B showed LEANN nearly identical to full ANN indices (EM and F1 within 1%), significantly outperforming BM25/PQ approaches due to superior recall (Wang et al., 9 Jun 2025).

5. Trade-offs, Limitations, and Extensions

LEANN’s core trade-off is between recomputation overhead and storage minimization. For extremely high-dimensional embeddings or very large ef\mathrm{ef}, latency may increase disproportionately; lighter embedding models (e.g., GTE-small) can halve query latency at a minor (<2%) accuracy cost. LEANN does not yet support online incremental updates; any dynamic data addition necessitates offline rebuilding of the full pruned HNSW graph, involving peak storage roughly equal to the sum of raw data and embeddings. Caching high-degree node embeddings provides accelerated access but is bounded by I/O limits at high cache utilization.

Potential extensions include multi-level (hierarchical) pruning across HNSW levels, using learned edge importance metrics beyond degree-based heuristics, and deploying hybrid quantization with selective recomputation for intermediate-degree nodes (Wang et al., 9 Jun 2025).

6. Deployment Guidance and Practical Considerations

Practical deployment of LEANN involves several steps:

  • Offline Construction: Build the underlying HNSW using standard settings (M=30M=30, efbuild=128ef_{\mathrm{build}}=128), followed by pruning to enforce storage constraints.
  • On-Device Operation: Persist the CSR graph on flash or SSD, ensuring the embedding server (e.g., TensorRT-optimized) and PQ tables (approx. 2 GB) fit entirely in DRAM. The graph itself (≈3 GB) can remain on secondary storage if access time is under 10 ms.
  • Tuning: Select ef\mathrm{ef}, re-ranking ratio aa, and batch size BB through calibration to reach desired latency and recall. Optionally, cache a subset (p%p\%) of high-degree nodes for additional latency reduction.
  • Requirements: The model supports deployment on commodity CPUs and device-level GPUs, making it suitable for smartphones, laptops, or small form-factor workstations.

A plausible implication is that, by trading minimal on-the-fly compute for aggressive storage reduction, LEANN enables practical, privacy-preserving embedding search and RAG on devices where conventional ANN structures are infeasible, without sacrificing recall or latency relative to full-index baselines (Wang et al., 9 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LEANN: A Low-Storage Vector Index.