LEANN: Low-Storage Vector Index
- LEANN is a low-storage vector index that prunes HNSW-style graphs to reduce storage below 5% of raw data while maintaining over 90% top-3 recall.
- It eliminates persistent full-precision embeddings by using on-the-fly recomputation and dynamic batching, optimizing both memory and compute resources.
- Empirical evaluations show LEANN achieves sub-2 second query latency and competitive recall, making it ideal for embedding-based search on personal devices.
LEANN (Low-Storage Efficient Approximate Nearest Neighbor) is a vector index designed for embedding-based search in resource-constrained environments, such as personal devices. It addresses the prohibitive storage requirements of standard ANN (Approximate Nearest Neighbor) indices by employing an aggressively pruned proximity graph and on-the-fly embedding recomputation, enabling high-recall and low-latency search with storage overhead below 5% of the raw data size (Wang et al., 9 Jun 2025).
1. Motivation and Background
Embedding-based search, integral to applications like recommendation and retrieval-augmented generation (RAG), utilizes high-dimensional vectors to encode semantic similarity. Standard ANN indices, including graph-based HNSW and cluster-based Faiss IVF, incur substantial storage overhead due to both full-precision embeddings and index metadata (neighbor lists, cluster assignments). Typical overheads range between 150–700% of the raw document size—for instance, 100 GB of data may require 150–700 GB in ANN index storage—which is manageable in datacenter-scale infrastructure but impractical for personal devices where SSD/flash storage is at a premium. The design objective for LEANN is to reduce index storage below 5% of raw data while sustaining greater than 90% top-3 recall and sub-2 s query latency on real-world QA benchmarks (Wang et al., 9 Jun 2025).
2. Architecture and Data Structures
LEANN’s architecture is based on a pruned, HNSW-style proximity graph and leverages on-the-fly recomputation to eliminate full-precision embedding storage post-construction.
- Node Representation: Each node corresponds to a document chunk (e.g., 256 tokens), and is identified by a unique integer ID.
- Edges and Storage: Graph edges are stored in a compressed sparse row (CSR) format, with each neighbor link requiring 4 bytes. LEANN applies selective degree pruning, preserving a small fraction () of high-degree “hub” nodes at maximum degree and reducing all other nodes to a much smaller degree .
- Storage Complexity: For chunks and node degrees , the graph’s on-disk size is given by bytes, with in practice. As an example, for 76 GB of raw data, the pruned LEANN graph occupies approximately 3 GB.
- Time Complexity: Index construction is dominated by the cost of building the initial HNSW graph. Query time involves best-first traversal of nodes. Each traversal step can trigger recomputation of up to embeddings. An embedding server processes chunks/s, yielding expected query latency .
The index construction and query routines are formally described in two core algorithms:
- Algorithm 1: Preserves high-degree nodes during graph pruning by maintaining the top of hubs at degree , while others are pruned to lower degree via best-first neighbor selection and bidirectional edge creation, subject to incoming edge limits.
- Algorithm 2: Implements a two-level search strategy with dynamic batching. An initial pool of candidate nodes is expanded by pushing approximate-distance-computed neighbors into a queue, from which the top are selected for exact distance computation in batches of size (typically 64–128, tuned for hardware throughput), thus balancing I/O and compute demands.
3. On-the-Fly Recomputation and Batching
A defining feature of LEANN is its complete elimination of persistent embedding vectors for all but a minuscule, optionally cached fraction of hub nodes. After offline index build, embeddings are discarded; at query time, the index uses lightweight PQ-compressed data (ca. 2 GB for screening) and dynamically batches the recomputation workload for only the top of candidates, leveraging device GPUs to maximize throughput.
This approach introduces a storage/computation trade-off, governed by parameters , , , , and . Increasing or embedding dimensionality raises recomputation cost, but high batch sizes and lightweight embedding models can offset increased latency. The system design allows practitioners to tune these hyperparameters to meet specific accuracy and latency requirements while enforcing strict storage budgets. For even more aggressive memory management, caching a small percentage of embeddings for high-degree nodes on local flash/NVMe enables further improvement at the cost of increased SSD I/O, which can become a bottleneck at high hit rates (Wang et al., 9 Jun 2025).
4. Empirical Performance
Evaluation on a 76 GB corpus (60 million chunks) demonstrates substantial gains in storage efficiency without material loss in search quality. The following table summarizes ANN methods on an NVIDIA A10 platform:
| Method | Size (% of 76 GB) | Recall@3 | Latency (s) |
|---|---|---|---|
| BM25 | – | 0.65 | 1.2 |
| HNSW (in-mem) | 225% (171 GB) | 0.90 | 0.04 |
| DiskANN | 250% (190 GB) | 0.90 | 0.05 |
| IVF-Disk | 225% (171 GB) | 0.90 | 0.30 |
| Edge-RAG (IVF rc) | <0.1% | 0.90 | 40.0 |
| LEANN | 3.8% | 0.90 | 1.8 |
LEANN consistently remained below 5% storage overhead and achieved over 90% recall@3 with query latency under 2 seconds across four QA benchmarks (NQ, TriviaQA, HotpotQA, GPQA). On Mac hardware, other ANN indices were out of memory or exhibited substantially higher resource requirements. Downstream RAG accuracy with Llama-3.2-1B showed LEANN nearly identical to full ANN indices (EM and F1 within 1%), significantly outperforming BM25/PQ approaches due to superior recall (Wang et al., 9 Jun 2025).
5. Trade-offs, Limitations, and Extensions
LEANN’s core trade-off is between recomputation overhead and storage minimization. For extremely high-dimensional embeddings or very large , latency may increase disproportionately; lighter embedding models (e.g., GTE-small) can halve query latency at a minor (<2%) accuracy cost. LEANN does not yet support online incremental updates; any dynamic data addition necessitates offline rebuilding of the full pruned HNSW graph, involving peak storage roughly equal to the sum of raw data and embeddings. Caching high-degree node embeddings provides accelerated access but is bounded by I/O limits at high cache utilization.
Potential extensions include multi-level (hierarchical) pruning across HNSW levels, using learned edge importance metrics beyond degree-based heuristics, and deploying hybrid quantization with selective recomputation for intermediate-degree nodes (Wang et al., 9 Jun 2025).
6. Deployment Guidance and Practical Considerations
Practical deployment of LEANN involves several steps:
- Offline Construction: Build the underlying HNSW using standard settings (, ), followed by pruning to enforce storage constraints.
- On-Device Operation: Persist the CSR graph on flash or SSD, ensuring the embedding server (e.g., TensorRT-optimized) and PQ tables (approx. 2 GB) fit entirely in DRAM. The graph itself (≈3 GB) can remain on secondary storage if access time is under 10 ms.
- Tuning: Select , re-ranking ratio , and batch size through calibration to reach desired latency and recall. Optionally, cache a subset () of high-degree nodes for additional latency reduction.
- Requirements: The model supports deployment on commodity CPUs and device-level GPUs, making it suitable for smartphones, laptops, or small form-factor workstations.
A plausible implication is that, by trading minimal on-the-fly compute for aggressive storage reduction, LEANN enables practical, privacy-preserving embedding search and RAG on devices where conventional ANN structures are infeasible, without sacrificing recall or latency relative to full-index baselines (Wang et al., 9 Jun 2025).