BBC: Improving Large-k Approximate Nearest Neighbor Search with a Bucket-based Result Collector

Published 2 Apr 2026 in cs.DB and cs.DS | (2604.01960v1)

Abstract: Although Approximate Nearest Neighbor (ANN) search has been extensively studied, large-k ANN queries that aim to retrieve a large number of nearest neighbors remain underexplored, despite their numerous real-world applications. Existing ANN methods face significant performance degradation for such queries. In this work, we first investigate the reasons for the performance degradation of quantization-based ANN indexes: (1) the inefficiency of existing top-k collectors, which incurs significant overhead in candidate maintenance, and (2) the reduced pruning effectiveness of quantization methods, which leads to a costly re-ranking process. To address this, we propose a novel bucket-based result collector (BBC) to enhance the efficiency of existing quantization-based ANN indexes for large-k ANN queries. BBC introduces two key components: (1) a bucket-based result buffer that organizes candidates into buckets by their distances to the query. This design reduces ranking costs and improves cache efficiency, enabling high performance maintenance of a candidate superset and a lightweight final selection of top-k results. (2) two re-ranking algorithms tailored for different types of quantization methods, which accelerate their re-ranking process by reducing either the number of candidate objects to be re-ranked or cache misses. Extensive experiments on real-world datasets demonstrate that BBC accelerates existing quantization-based ANN methods by up to 3.8x at recall@k = 0.95 for large-k ANN queries.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces BBC, a cache-efficient bucket-based result collector that enhances throughput by 1.4×–3.8× for large-k ANN searches.
It addresses heap inefficiencies and degraded quantization pruning by using specialized re-ranking algorithms for bounded and unbounded methods.
Empirical and theoretical results confirm minimal quantization error and maintained recall, supporting scalable vector search in large datasets.

BBC: A Formal Analysis of Bucket-based Large- $k$ ANN Result Collection

Problem Statement and Motivation

The paper "BBC: Improving Large-k Approximate Nearest Neighbor Search with a Bucket-based Result Collector" (2604.01960) addresses the efficiency challenges associated with large- $k$ approximate nearest neighbor (ANN) search over high-dimensional vector datasets. Large- $k$ queries ( $k \geq 5{,}000$ ) arise in critical applications such as candidate recall for recommendation, large-scale rerank pipelines in retrieval-augmented generation, and dataset construction in ML model training. While ANN search is well-studied for standard (small) $k$ ( $k<1000$ ), existing systems exhibit severe throughput degradation for large $k$ due to two dominant factors: (1) priority queue-based top- $k$ collectors suffering from cache inefficiency as $k$ grows, and (2) declining pruning efficiency in quantization-based ANN, leading to growing re-ranking costs.

Empirical evaluation across several methods (IVF, IVF+PQ, IVF+RaBitQ, HNSW) reveals a 4.8 $\times$ --5.7 $k$ 0 drop in throughput when $k$ 1 increases from 100 to 5,000 at high recall levels (see Figure 1).

Figure 1: Querying performance of IVF, HNSW, IVF+PQ, and IVF+RaBitQ on the C4 dataset at $k$ 2 and $k$ 3.

The motivation for focusing on quantization-based methods stems from their superior resilience to elevated $k$ 4 as compared to graph-based algorithms (e.g., HNSW), whose design is inherently tuned for small- $k$ 5 retrieval via local neighborhood traversal.

Analysis of Bottlenecks

The performance bottlenecks for large- $k$ 6 ANN are twofold:

Top- $k$ 7 Collector Inefficiency: Classic binary heap implementations for collecting the $k$ 8 nearest results incur $k$ 9 complexity per insertion and, crucially, increasingly frequent L1 cache misses when the heap (distance, id) pair array exceeds L1 capacity ( $k$ 032KB). This effect is pronounced for $k$ 1, where L1 miss rates soar, and heap maintenance can dominate runtime (in IVF+RaBitQ, the share grows from 2% to 23% as $k$ 2 scales).
Pruning Degradation: Quantization methods (bounded such as RaBitQ and unbounded such as PQ) both require expanding candidate pools as $k$ 3 grows. For bounded variants, overlap between bound intervals and threshold expands, requiring more frequent expensive re-ranking. Unbounded methods must linearly scale candidate pools relative to $k$ 4 to achieve target recall.
Figure 2: Time overhead breakdown at varying $k$ 5, showing increasing dominance of heap and distance computation as $k$ 6 increases.

BBC: The Bucket-based Result Collector

The paper introduces BBC—a cache-efficient, bucket-quantized result collector—designed to replace conventional heap-based top- $k$ 7 collectors in the ANN pipeline (IVF+PQ, IVF+RaBitQ). BBC comprises:

Bucket-based Result Buffer: The candidate distance range is partitioned via one-dimensional quantization into $k$ 8 buckets, each storing candidates in linear buffers (IDs/distances). Candidates are only ordered across buckets (coarse order). Sequential insertion enables hardware prefetching and greatly reduces L1 cache miss rates. Updating the threshold bucket is amortized and localized.
Specialized Re-ranking Algorithms: Distinct mechanisms for bounded (RaBitQ-style) and unbounded (PQ-style) quantizers:
- For bounded quantization, the algorithm greedily skips re-ranking for candidates provably out/in (via bound-threshold comparison), re-ranking only objects near the threshold, minimizing redundant exact distance computation.
- For unbounded quantization, an early re-ranking mechanism computes exact distances for predicted in-boundary candidates right after estimated distances are computed, exploiting linear buffer layout for optimal cache use.
  Figure 3: Probability density of query-data distances in C4 showing concentration, motivating effectiveness of bucketization.

Quantitative Performance and Empirical Validation

BBC’s design is motivated by the distance concentration observed in high-dimensional settings: bucket partitioning using equal-depth quantization tightly bounds the error between the relaxed (bucket) threshold and the true threshold. Theoretical bounds demonstrate that for high $k$ 9 and moderate $k \geq 5{,}000$ 0, the quantization error is negligible.

Empirical results validate substantial gains:

Speedup: BBC-integrated systems (IVF+PQ+BBC and IVF+RaBitQ+BBC) yield 1.4 $k \geq 5{,}000$ 1–3.8 $k \geq 5{,}000$ 2 throughput increases with recall@ $k \geq 5{,}000$ 3=0.95, with the advantage expanding as $k \geq 5{,}000$ 4 grows, especially on large datasets (Figure 4).
Collector Overhead: BBC reduces top- $k \geq 5{,}000$ 5 collection time by up to an order of magnitude versus heaps or sorted-buffer alternatives, halving L1 cache misses (Figure 5, Table 1).
Re-ranking Efficiency: BBC’s greedy and early re-ranking algorithms reduce both the number of objects needing exact evaluation and per-object access cost, accelerating re-ranking by up to 1.8 $k \geq 5{,}000$ 6.
Accuracy Guarantees: Theoretical and empirical study show bucket thresholds deviate on the order of $k \geq 5{,}000$ 7 from the true value (Figure 6), preserving accuracy and strictly controlling recall loss (Figure 7).
Small- $k \geq 5{,}000$ 8 Regime: BBC does not degrade performance for small $k \geq 5{,}000$ 9 (e.g., $k$ 0), matching or slightly outperforming heap-based approaches due to batch and prefetch advantages (Figure 8).

Figure 4: Accuracy-efficiency trade-off for varying $k$ 1. BBC consistently outperforms baselines, especially as $k$ 2 increases.

System Design and Implementation Choices

Selecting the number of buckets $k$ 3 is analytically tied to L1 cache capacity, quantization code footprint, and lookup table requirements. The paper provides an explicit formula balancing these constraints.

Bucket partitioning leverages equal-depth quantization via sampling over $k$ 4 to adapt to local distance distributions per query, essential for high-dimensional performance.

SIMD acceleration is exploited for code and threshold computation, further enhancing line-rate performance.

Theoretical and Practical Implications

Theoretical Implications

BBC establishes the first formal result-collection framework explicitly optimized for large- $k$ 5 ANN, revealing the sharp interplay between memory hierarchy, quantization error, and algorithmic pipeline design.
The results quantify a critical performance transition point in $k$ 6 where cache-aware design is mandatory, influencing the broader vector search literature on collector design.
The separation of re-ranking logic by quantizer-type (bounded vs. unbounded) provides a template for future indexer design (particularly hybrid systems leveraging both).

Practical Implications

BBC is compatible with all similarity metrics admitting order (Euclidean, inner-product, cosine), making it plug-and-play for prevailing quantization-based indexes and modern vector DBMSs (e.g., IVF, PQ, RaBitQ, etc.).
The memory footprint of BBC is negligible compared to dataset scales, making it suitable for deployment in both memory- and disk-bound environments.
Key applications include recommendation candidate generation at scale, multi-stage rerank pipelines for LLM-backed RAG, and iterative data mining over billion-scale datasets.

Future Directions

Graph-based ANN Integration: Extending BBC’s bucketed collection strategies to graph-based ANN methods, which currently degrade dramatically on large- $k$ 7 due to unscalable heap dependence.
GPU/Batch Processing: Adapting bucket partitioning and cache-centric logic for high-throughput GPU pipelines with warp-level collectors and multi-query batch collation.
Algorithmic Generalization: Exploring variable-sized buckets, dynamic binning, or adaptive quantization codebook construction per query for further improvements.

Conclusion

The BBC framework, comprising a bucketed result collector and quantization-aware re-ranking strategies, resolves fundamental bottlenecks in large- $k$ 8 ANN search. Through a combination of cache model-aware data layout, quantization-driven bucket partitioning, and specialized re-ranking algorithms, BBC achieves up to 3.8 $k$ 9 performance increase without recall tradeoff, maintaining or improving performance for small $k<1000$ 0, and providing a formal methodology for collector design in scale-adaptive vector search infrastructure.

The approach stands as a new benchmark for scalable ANN design, with strong relevance for high-throughput AI workloads and modern vector-oriented data systems.

Markdown Report Issue