Papers
Topics
Authors
Recent
Search
2000 character limit reached

Similarity-Based Filtering Strategy

Updated 11 January 2026
  • Similarity-Based Filtering is a technique that integrates similarity search with attribute filtering to efficiently rank and retrieve large-scale data.
  • The approach employs a hybrid indexing method combining centroid clustering and bitwise operations to quickly prune non-matching candidates.
  • Empirical results on billion-scale datasets show high recall and sub-second query times using CPU-side optimizations and dynamic disk loading.

A similarity-based filtering strategy is any approach that exploits a formal notion of similarity to restrict, rank, or select entities—data points, users, items, features, edges—in order to efficiently retrieve, recommend, or process relevant content at scale. Such strategies are foundational in information retrieval, recommender systems, domain-adaptation, graph processing, and large-scale search. Recent advances have extended classic memory-based and diffusion-based methods to support multi-dimensional filtering, adaptive weights, stochastic or spectral sparsification, domain adaptation, and ultra-large datasets with complex filtering constraints (Emanuilov et al., 23 Jan 2025).

The IVF-Flat hybrid index extends the classical Inverted File structure to integrate similarity filtering with multi-dimensional attribute constraints. Each data point ii is represented by core embedding xi∈RDx_i \in \mathbb{R}^D and discrete attribute vector ai∈ZMa_i \in \mathbb{Z}^M, forming a hybrid vector hi=[xi∥ai]h_i = [x_i \parallel a_i] but with xix_i, aia_i stored separately to leverage BLAS acceleration (for embedding similarity) and bitwise ops (for attribute filtering).

Index Construction Workflow:

  1. Run KK-means on {xi}\{ x_i \} to obtain centroids {c1,…,cK}\{ c_1,\dots,c_K \}.
  2. Assign each xix_i to nearest ckc_k.
  3. Build KK inverted lists L1,…,LKL_1,\dots, L_K; each LkL_k stores the xix_i and aia_i for members.
  4. Centroids reside in RAM, lists on disk.

Query and Filtering:

  • Query consists of xqx_q and attribute filter vector aqa_q.
  • Probe only T≪KT \ll K nearest centroid lists.
  • For each LkL_k, load only aia_i; apply all MM bitwise/integer filters in RAM.
  • Non-matching candidates are discarded prior to loading xix_i for final similarity ranking.

This scheme enables joint kkNN and attribute-based (SQL-style) queries in a unified index, efficiently handling billion-scale collections while supporting arbitrarily complex, multi-dimensional filters (Emanuilov et al., 23 Jan 2025).

2. Mathematical Formulation and Hard/Soft Filtering

Let

  • X={x1,…,xN}X = \{ x_1, \dots, x_N \}, xi∈RDx_i \in \mathbb{R}^D
  • A={a1,…,aN}A = \{ a_1, \dots, a_N \}, ai∈ZMa_i \in \mathbb{Z}^M
  • d(â‹…,â‹…)d(\cdot, \cdot) the base similarity metric (e.g., Euclidean or negative inner product).

For each attribute filter j=1…Mj=1\dots M, define indicator fj(ai)f_j(a_i): fj(ai)={1,if ai,j matches filter 0,otherwisef_j(a_i) = \begin{cases} 1, & \text{if } a_{i,j} \text{ matches filter} \ 0, & \text{otherwise} \end{cases} Composite score with hard/soft masks: S(xq,y)=αd(xq,y)+∑j=1Mβj[1−fj(ay)]S(x_q, y) = \alpha d(x_q, y) + \sum_{j=1}^M \beta_j[1 - f_j(a_y)] Hard filtering is recovered for βj→∞\beta_j \rightarrow \infty: candidates with fj=0f_j = 0 for any jj are discarded (S=∞S = \infty) (Emanuilov et al., 23 Jan 2025).

3. Algorithmic Pipeline: Indexing and Query

The retrieval workflow can be summarized as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
C = KMEANS(X, K)
for i in 1..N:
    k = argmin_{k'} d(x_i, C[k'])
    L_k.append((x_i, a_i))

C_distances = [d(x_q, c_1), ..., d(x_q, c_K)]
R = indices of T smallest in C_distances
Cand = min-heap(size=k)
for r in R:
    A_r = {a_i: i in L_r}
    Mask = [AND over M filters per a_i in A_r]
    Indices = {i in L_r | Mask[i]==1}
    X_r = {x_i: i in Indices}
    for i in Indices:
        dist = d(x_q, x_i)
        Cand.push((i, dist))
        if Cand.size > k: Cand.pop_farthest()
Return Cand.sorted_by_distance()
This structure ensures only the relevant attribute blocks are loaded and searched, maximizing locality and minimizing IO/memory footprint (Emanuilov et al., 23 Jan 2025).

4. Scalability, Sublinear Complexity, and CPU Optimization

The worst-case cost is O(N)O(N) (if filters are broad or T≈KT \approx K), but typically, with centroid pruning (T≪KT \ll K), multi-dimensional filtering (bitmask tests), and BLAS-accelerated distance computation, query complexity is sublinear in NN. Disk-based on-demand loading yields O(TV(D+M))O(T V (D + M)) RAM requirement, which remains practical even for N=109N=10^9. On standard CPUs, parallelization (e.g., OpenMP BLAS) and asynchronous I/O further accelerate search and allow queries on commodity hardware (Emanuilov et al., 23 Jan 2025).

5. Empirical Validation: LAION-5B Case Study

A full index of N=109N = 10^9 images (D=768D=768) with M=10M=10 synthetic attributes, K=N≈32,000K=\sqrt{N}\approx32,000, T=7T=7 (lists probed), on a single Xeon E-2274G (64 GB RAM, NVMe+SATA), yields:

Operation Time (s)
Centroid coarse search 0.008
Attribute filtering 1.090
Detailed â„“2\ell_2 on filtered 0.330
Total 1.428

With 12-thread BLAS, queries reduced from ~16s to ~1.4s. Typical filter selectivity (fraction of vectors passing) was 5–20%. Recall vs. TT example: R(T)≈0.75R(T)\approx 0.75 for T=3T=3, $0.89$ for T=7T=7, $0.95$ for T=15T=15 at k=100k=100.

This demonstrates sub-second kNN+filter queries at billion-scale, high recall, and low hardware cost—without GPU acceleration (Emanuilov et al., 23 Jan 2025).

6. Summary and Practical Implications

The advanced similarity-based filtering strategy implemented via hybrid IVF-Flat indexing introduces a scalable, unified retrieval model for billion-point data, answering joint kNN and multi-dimensional attribute queries in sublinear time. Key innovations include physical separation of embedding and attribute vectors, efficient bitwise filtering, disk-resident inverted lists with dynamic loading, and full CPU-side optimization. The performance validates its suitability for real-world billion-scale datasets.

This approach generalizes to any context where similarity search must be tightly coupled to arbitrary discrete filtering—offering practical kNN+SQL filtering pipelines for generic high-dimensional retrieval tasks (Emanuilov et al., 23 Jan 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Similarity-Based Filtering Strategy.