Results of the Big ANN: NeurIPS'23 competition

Published 25 Sep 2024 in cs.IR, cs.DS, cs.LG, and cs.PF | (2409.17424v1)

Abstract: The 2023 Big ANN Challenge, held at NeurIPS 2023, focused on advancing the state-of-the-art in indexing data structures and search algorithms for practical variants of Approximate Nearest Neighbor (ANN) search that reflect the growing complexity and diversity of workloads. Unlike prior challenges that emphasized scaling up classical ANN search ~\cite{DBLP:conf/nips/SimhadriWADBBCH21}, this competition addressed filtered search, out-of-distribution data, sparse and streaming variants of ANNS. Participants developed and submitted innovative solutions that were evaluated on new standard datasets with constrained computational resources. The results showcased significant improvements in search accuracy and efficiency over industry-standard baselines, with notable contributions from both academic and industrial teams. This paper summarizes the competition tracks, datasets, evaluation metrics, and the innovative approaches of the top-performing submissions, providing insights into the current advancements and future directions in the field of approximate nearest neighbor search.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that specialized indexing structures, such as hybrid tag/Vamana graphs, achieve over 11× QPS improvement in filtered search.
The paper's innovative methods in adapting graph-based approaches and quantization techniques significantly boost performance in OOD, sparse, and streaming search scenarios.
The paper's rigorous evaluation on standardized Azure machines emphasizes both high recall and throughput, highlighting practical applications in real-time data retrieval.

Analysis of the Big ANN: NeurIPS'23 Competition

The 2023 Big Approximate Nearest Neighbor (ANN) Competition presented at NeurIPS 2023 aimed at advancing the field of ANN search by evaluating innovative solutions to complex real-world indexing and search problems. The challenges included filtered search, out-of-distribution (OOD) data, sparse vector search, and streaming scenarios, providing a comprehensive assessment of contemporary issues and innovations in ANN search methodologies.

Competition Tracks and Contributions

Filtered Search Track

The filtered search track, which involves indexing based on both semantic properties and associated keywords, used the YFCC 100M dataset. The competition revealed significant advancements over the baseline, achieving more than 11 times the increase in query-per-second (QPS) performance. Notably, the winning team ParlayANN utilized a hybrid structure of tags and Vamana graphs for efficient filtering and vector search, demonstrating the potential of specialized indexing structures.

Out-Of-Distribution (OOD) Track

Addressing the challenge of differing database and query vector distributions, this track used a subset of the Yandex visual search database. The joint winners, PyANNS and MysteryANN (RoarANN), achieved noteworthy improvements. PyANNS used quantization techniques and optimized vector graph searches while MysteryANN focused on graph structuring and leveraging query distributions. This demonstrates the efficacy of adapting indexing strategies to query distributions.

Sparse Track

In the sparse vector track, participants worked with MSMARCO passage retrieval data, tackling the challenge of high-dimensional sparse data. The top-performing entries from PyANNS and GrassRMA employed innovative graph-based approaches along with quantization and optimized memory access strategies. These solutions emphasized the importance of addressing memory access efficiency and the utilization of hierarchical graphs in sparse vector scenarios.

Streaming Search Track

The streaming track required indexing strategies to handle dynamic datasets with frequent insertions and deletions. The initial results declared Puck as the winner; however, a later correction identified PyANNS as the leading solution. PyANNS combined DiskANN with efficient 8-bit quantization, highlighting the vital role of raw computational efficiency and adaptive graph management in streaming contexts.

Evaluation and Impact

The assessment was performed using standardized Azure virtual machines, focusing on recall and throughput metrics. Submissions were evaluated on their ability to achieve a recall rate of at least 90% while maximizing query throughput. This rigorous evaluation ensured a focus on practical, constrained computational environments, reflecting real-world application needs.

The results showcased substantial advancements and varying methodologies from both academic and industrial teams. For instance, improvements from specialized data structures for filtered searches and graph-based solutions in OOD and sparse tracks underscore the diverse approaches driven by specific problem contexts.

Implications and Future Directions

The contributions from this competition underscore the importance of problem-specific optimizations in ANN search. Innovative indexing structures, hybrid algorithms, and efficient memory management techniques emerged as key themes. These advancements could drive further research and potential applications in fields like computer vision, NLP, and real-time data processing.

Future research might explore:

Enhancing hybrid approaches combining advantages of different indexing strategies.
Developing generalizable solutions that maintain high performance across multiple search scenarios.
Investigating further optimizations for streaming indices to handle real-time data influx.

By fostering open-source contributions and accessible resource utilization, the Big ANN Challenge has played a pivotal role in pushing forward the boundaries of ANN research. Its influence is likely to spur continued innovations and applications within the AI and data retrieval communities.