Similarity-Based Filtering Strategy
- Similarity-Based Filtering is a technique that integrates similarity search with attribute filtering to efficiently rank and retrieve large-scale data.
- The approach employs a hybrid indexing method combining centroid clustering and bitwise operations to quickly prune non-matching candidates.
- Empirical results on billion-scale datasets show high recall and sub-second query times using CPU-side optimizations and dynamic disk loading.
A similarity-based filtering strategy is any approach that exploits a formal notion of similarity to restrict, rank, or select entities—data points, users, items, features, edges—in order to efficiently retrieve, recommend, or process relevant content at scale. Such strategies are foundational in information retrieval, recommender systems, domain-adaptation, graph processing, and large-scale search. Recent advances have extended classic memory-based and diffusion-based methods to support multi-dimensional filtering, adaptive weights, stochastic or spectral sparsification, domain adaptation, and ultra-large datasets with complex filtering constraints (Emanuilov et al., 23 Jan 2025).
1. Hybrid Similarity-Based Indexing for Billion-Scale Search
The IVF-Flat hybrid index extends the classical Inverted File structure to integrate similarity filtering with multi-dimensional attribute constraints. Each data point is represented by core embedding and discrete attribute vector , forming a hybrid vector but with , stored separately to leverage BLAS acceleration (for embedding similarity) and bitwise ops (for attribute filtering).
Index Construction Workflow:
- Run -means on to obtain centroids .
- Assign each to nearest .
- Build inverted lists ; each stores the and for members.
- Centroids reside in RAM, lists on disk.
Query and Filtering:
- Query consists of and attribute filter vector .
- Probe only nearest centroid lists.
- For each , load only ; apply all bitwise/integer filters in RAM.
- Non-matching candidates are discarded prior to loading for final similarity ranking.
This scheme enables joint NN and attribute-based (SQL-style) queries in a unified index, efficiently handling billion-scale collections while supporting arbitrarily complex, multi-dimensional filters (Emanuilov et al., 23 Jan 2025).
2. Mathematical Formulation and Hard/Soft Filtering
Let
- ,
- ,
- the base similarity metric (e.g., Euclidean or negative inner product).
For each attribute filter , define indicator : Composite score with hard/soft masks: Hard filtering is recovered for : candidates with for any are discarded () (Emanuilov et al., 23 Jan 2025).
3. Algorithmic Pipeline: Indexing and Query
The retrieval workflow can be summarized as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
C = KMEANS(X, K) for i in 1..N: k = argmin_{k'} d(x_i, C[k']) L_k.append((x_i, a_i)) C_distances = [d(x_q, c_1), ..., d(x_q, c_K)] R = indices of T smallest in C_distances Cand = min-heap(size=k) for r in R: A_r = {a_i: i in L_r} Mask = [AND over M filters per a_i in A_r] Indices = {i in L_r | Mask[i]==1} X_r = {x_i: i in Indices} for i in Indices: dist = d(x_q, x_i) Cand.push((i, dist)) if Cand.size > k: Cand.pop_farthest() Return Cand.sorted_by_distance() |
4. Scalability, Sublinear Complexity, and CPU Optimization
The worst-case cost is (if filters are broad or ), but typically, with centroid pruning (), multi-dimensional filtering (bitmask tests), and BLAS-accelerated distance computation, query complexity is sublinear in . Disk-based on-demand loading yields RAM requirement, which remains practical even for . On standard CPUs, parallelization (e.g., OpenMP BLAS) and asynchronous I/O further accelerate search and allow queries on commodity hardware (Emanuilov et al., 23 Jan 2025).
5. Empirical Validation: LAION-5B Case Study
A full index of images () with synthetic attributes, , (lists probed), on a single Xeon E-2274G (64 GB RAM, NVMe+SATA), yields:
| Operation | Time (s) |
|---|---|
| Centroid coarse search | 0.008 |
| Attribute filtering | 1.090 |
| Detailed on filtered | 0.330 |
| Total | 1.428 |
With 12-thread BLAS, queries reduced from ~16s to ~1.4s. Typical filter selectivity (fraction of vectors passing) was 5–20%. Recall vs. example: for , $0.89$ for , $0.95$ for at .
This demonstrates sub-second kNN+filter queries at billion-scale, high recall, and low hardware cost—without GPU acceleration (Emanuilov et al., 23 Jan 2025).
6. Summary and Practical Implications
The advanced similarity-based filtering strategy implemented via hybrid IVF-Flat indexing introduces a scalable, unified retrieval model for billion-point data, answering joint kNN and multi-dimensional attribute queries in sublinear time. Key innovations include physical separation of embedding and attribute vectors, efficient bitwise filtering, disk-resident inverted lists with dynamic loading, and full CPU-side optimization. The performance validates its suitability for real-world billion-scale datasets.
This approach generalizes to any context where similarity search must be tightly coupled to arbitrary discrete filtering—offering practical kNN+SQL filtering pipelines for generic high-dimensional retrieval tasks (Emanuilov et al., 23 Jan 2025).