LEANN: A Low-Storage Vector Index

Published 9 Jun 2025 in cs.DB and cs.LG | (2506.08276v1)

Abstract: Embedding-based search is widely used in applications such as recommendation and retrieval-augmented generation (RAG). Recently, there is a growing demand to support these capabilities over personal data stored locally on devices. However, maintaining the necessary data structure associated with the embedding-based search is often infeasible due to its high storage overhead. For example, indexing 100 GB of raw data requires 150 to 700 GB of storage, making local deployment impractical. Reducing this overhead while maintaining search quality and latency becomes a critical challenge. In this paper, we present LEANN, a storage-efficient approximate nearest neighbor (ANN) search index optimized for resource-constrained personal devices. LEANN combines a compact graph-based structure with an efficient on-the-fly recomputation strategy to enable fast and accurate retrieval with minimal storage overhead. Our evaluation shows that LEANN reduces index size to under 5% of the original raw data, achieving up to 50 times smaller storage than standard indexes, while maintaining 90% top-3 recall in under 2 seconds on real-world question answering benchmarks.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel storage-efficient vector index that recomputes embeddings on-the-fly using a graph-based approach.
It leverages a high-degree preserving graph pruning algorithm to maintain search accuracy while reducing the storage footprint to under 5%.
Dynamic batching optimizes GPU usage, ensuring query latency remains under 2 seconds with 90% top-3 recall performance.

LEANN: A Low-Storage Vector Index

Introduction

The paper "LEANN: A Low-Storage Vector Index" explores the development of a storage-efficient approximate nearest neighbor (ANN) search index, optimized for deployment on personal devices that have limited resources. With the increasing demand for embedding-based search over locally stored personal data, traditional indexing methods often result in prohibitively high storage overhead. LEANN aims to address this challenge by reducing the index size while maintaining high retrieval performance and low latency.

System Design and Methodology

Graph-Based Structure and Recomputation Strategy

LEANN is designed to optimize both storage and computational efficiency. At its core, it employs a graph-based index structure inspired by the Hierarchical Navigable Small World (HNSW) model but introduces significant modifications:

Graph-Based Recomputation: LEANN stores no embeddings, only the graph metadata. Embeddings are recomputed on-the-fly at query time, minimizing storage needs. This is achieved through the use of a two-level traversal algorithm that interleaves approximate and exact distance computations, prioritizing the most promising candidates to minimize recomputation costs.
Figure 1: Best-First Search in graph-based index.

High-Degree Preserving Graph Pruning

To manage storage efficiently, LEANN applies a high-degree preserving graph pruning technique. This method involves selectively retaining high-degree nodes that are critical to search performance, while pruning redundant low-utility edges in the graph:

Graph Pruning Algorithm: The algorithm differentiates node importance based on their degree, ensuring critical nodes maintain higher connectivity. This targeted pruning significantly reduces storage footprint without compromising on search accuracy.
Figure 2: LEANN System Diagram. The system combines high-degree preserving graph pruning for minimal storage footprint with graph-based recomputation and two-level search with dynamic batching for efficient query processing (Steps 1-4).

Optimizations for Latency

Dynamic Batching Mechanism

LEANN incorporates dynamic batching to enhance computational efficiency during query processing. By grouping recomputation tasks and leveraging GPU resources effectively, LEANN can reduce latency significantly:

Batch Execution Strategy: The system collects and processes node embeddings in batches, optimizing GPU utilization and reducing the overhead associated with per-node calculations.
Figure 3: Node access probability per query.

Performance Evaluation

LEANN's evaluation on real-world question answering benchmarks demonstrates impressive results:

Storage Efficiency: LEANN achieves a drastic reduction in index size to under 5% of the original data size.
Accuracy and Speed: Despite reduced storage, LEANN maintains 90% top-3 recall with query latency under 2 seconds, showcasing a significant advantage over conventional indices.

Figure 4: A10.

Ablation Studies

Detailed ablation experiments highlight the contribution of each component to LEANN's performance, particularly the impact of the graph pruning strategy and dynamic batching on latency and storage requirements.

Figure 5: [Main Result]: Comparison of Exact Match and F1 scores for downstream RAG tasks across three methods: keyword search (BM25), PQ-compressed vector search, and our proposed vector search system. Our method is configured to achieve a target recall of 90%, while the PQ baseline is given extended search time to reach its highest possible recall. Here we use Llama-3.2-1B as the generation model.

Conclusion

LEANN presents a novel solution to the pressing challenge of storage-efficient vector search on personal devices. By integrating compact graph structures, on-the-fly recomputation, and tailored traversal algorithms, LEANN significantly lowers storage overhead while retaining high retrieval quality. As such, it opens new avenues for embedding-based search applications on edge devices, enabling personalized and responsive query handling even in resource-constrained environments. Future work could explore expanding these methodologies to other graph-based indices, further optimizing latency, and understanding the broader implications of these techniques in distributed scaling.

Markdown Report Issue