Papers
Topics
Authors
Recent
Search
2000 character limit reached

BM25S: Accelerated Sparse BM25 Retrieval

Updated 9 February 2026
  • BM25S is a Python-based information retrieval framework that precomputes term-document BM25 scores to enable rapid, vectorized query ranking.
  • It leverages sparse matrix storage and BLAS-accelerated operations to achieve up to 500× speedups over conventional BM25 implementations.
  • BM25S supports differential scoring for BM25 variants, ensuring accurate retrieval while maintaining memory efficiency during indexing and query-time aggregation.

BM25S (“BM25 via eager sparse scoring”) is a Python-based information retrieval framework that accelerates BM25 and its variants by precomputing all possible term-document BM25 scores at index time and storing them in a sparse matrix, thereby enabling orders-of-magnitude faster query-time ranked retrieval compared to conventional implementations. BM25S relies on NumPy, SciPy, and (optionally) libstemmer, and achieves up to 500× speedups over popular Python toolkits, while outperforming highly optimized Java-based solutions such as Elasticsearch by factors of 2–10× on many standard datasets. BM25S also generalizes to “non-occurrence” BM25 variants (such as BM25+ and BM25L) via a differential-score shifting method that preserves sparse storage and exact accuracy (Lù, 2024).

1. Classical BM25 Scoring

BM25 is a family of lexical ranking functions used in text retrieval. For a document collection CC of size C|C| and a query Q={q1,...,qQ}Q = \{q_1, ..., q_{|Q|}\}, classical BM25 assigns to each document DCD \in C a score

B(Q,D)=i=1QS(qi,D)B(Q, D) = \sum_{i=1}^{|Q|} S(q_i, D)

where the term-document score S(t,D)S(t, D) is typically given (following Lucene) as:

S(t,D)=IDF(t,C)TF(t,D)TF(t,D)+k1(1b+bDLavg)S(t, D) = \mathrm{IDF}(t, C) \cdot \frac{\mathrm{TF}(t, D)} {\mathrm{TF}(t, D) + k_1 \left(1 - b + b \frac{|D|}{L_{\mathrm{avg}}} \right)}

with the following standard components:

  • TF(t,D)\mathrm{TF}(t, D): frequency of term tt in document DD
  • D|D|: number of tokens in DD
  • Lavg=(1/C)DCDL_{\mathrm{avg}} = (1/|C|)\sum_{D\in C}|D|: average document length
  • k1>0k_1 > 0, b[0,1]b \in [0, 1]: tunable parameters (e.g., k1=1.5k_1=1.5, b=0.75b=0.75)
  • DF(t,C)={DC:tD}\mathrm{DF}(t,C) = |\{ D \in C : t \in D \}|: term document frequency
  • IDF(t,C)=ln(CDF(t,C)+0.5DF(t,C)+0.5+1)\mathrm{IDF}(t, C) = \ln\left( \frac{|C| - \mathrm{DF}(t, C) + 0.5}{\mathrm{DF}(t, C) + 0.5} + 1 \right)

Conventional BM25 implementations recompute TF\mathrm{TF} and IDF\mathrm{IDF} (or look up) at query time and evaluate S(qi,D)S(q_i, D) for every DD with qiq_i (via inverted indexes).

2. Eager Sparse Scoring and Matrix Construction

BM25S departs from the traditional inverted-index paradigm by eagerly evaluating each nonzero S(t,D)S(t,D) during corpus indexing and storing the results as a sparse term-document matrix MRV×CM \in \mathbb{R}^{|V| \times |C|}, where VV is the vocabulary:

  • Each unique word token (possibly after stemming and stopword removal) is mapped to an integer row index r[0,V)r \in [0, |V|).
  • For document DD (column cc), Mr,c=S(t,D)M_{r,c} = S(t, D) for every tDt \in D.
  • Terms not present in a document lead to TF=0\mathrm{TF} = 0 and hence S=0S=0; such entries are omitted (sparsity).

Matrix MM is stored in CSC (Compressed Sparse Column) format, optimizing for fast access to document-wise sums and efficient slicing by multiple term rows. During querying, for a query Q={q1,...,qm}Q = \{q_1, ..., q_m\} with corresponding row indices r1,...,rmr_1, ..., r_m, BM25S extracts the m×Cm \times |C| submatrix M=M[[r1,...,rm],:]M' = M[[r_1, ..., r_m], :]. Summing across rows gives the score vector for all documents:

1
scores = np.array(M.sum(axis=0)).ravel()
This operation leverages BLAS-accelerated sparse summation, yielding efficient query-time document ranking.

3. Extension to Non-Occurrence Variants: Differential Scoring

Variants of BM25 (such as BM25+, BM25L, and others) may assign a nonzero score even when TF(t,D)=0\mathrm{TF}(t, D) = 0, i.e., when a term does not occur in a document. Let Sθ(t)=S(t,)S^\theta(t) = S(t, \emptyset) denote the “non-occurrence score.” For example, BM25+ uses:

S(t,D)=IDF(t)[TF(t,D)+δ]TF(t,D)+k1(1b+bD/Lavg)S(t, D) = \frac{\mathrm{IDF}(t) \cdot [\mathrm{TF}(t, D) + \delta]}{\mathrm{TF}(t, D) + k_1 (1-b + b |D|/L_{\mathrm{avg}})}

with δ>0\delta>0 and S(t,)=δIDF/(k1(1b+bD/Lavg))>0S(t, \emptyset) = \delta \, \mathrm{IDF} / (k_1(1-b + b |D|/L_{\mathrm{avg}})) > 0.

BM25S defines the “differential score”:

SΔ(t,D)=S(t,D)Sθ(t)S^\Delta(t, D) = S(t, D) - S^\theta(t)

For tDt \notin D, S(t,D)=Sθ(t)    SΔ(t,D)=0S(t,D) = S^\theta(t) \implies S^\Delta(t,D) = 0. Thus, SΔS^\Delta is still sparse. The aggregate BM25 score is exactly recovered as:

B(Q,D)=i=1mS(qi,D)=i=1mSΔ(qi,D)+i=1mSθ(qi)B(Q, D) = \sum_{i=1}^m S(q_i, D) = \sum_{i=1}^m S^\Delta(q_i, D) + \sum_{i=1}^m S^\theta(q_i)

BM25S stores only SΔS^\Delta in the sparse matrix; a small 1D array of SθS^\theta values (for each term) is maintained, and i=1mSθ(qi)\sum_{i=1}^m S^\theta(q_i) is computed once per query and added to the document scores. This approach generalizes to all BM25 variants covered in Kamphuis et al. (2020) without dense storage explosion.

4. Computational Complexity and Memory Analysis

BM25S exhibits the following complexity characteristics:

  • Index-time: O(DD)O(\sum_{D}|D|) arithmetic operations to compute all TF\mathrm{TF} and S(t,D)S(t,D); assembling the CSC sparse matrix with nnzDDnnz \approx \sum_D |D| explicitly stored entries (each entry: 8 bytes for float + two 4-byte indices).
  • Query-time: For a query of length mm and nn documents, extracting mm rows and summing mm sparse vectors takes O(idf(qi))O(\sum_i \mathrm{df}(q_i)), where df(qi)\mathrm{df}(q_i) is the posting list length for qiq_i. Top-kk selection is performed by numpy.argpartition\texttt{numpy.argpartition} in expected O(n)O(n) time.

Empirical usage shows idf(qi)n\sum_i \mathrm{df}(q_i) \ll n for typical queries, and fast C-based kernels dominate performance. For comparison:

  • Naive Python implementations (e.g., Rank-BM25) recompute all terms per query tuple, with O(mn)O(mn) Python-level operations.
  • Java Lucene computes TF/(TF+)\mathrm{TF}/(\mathrm{TF}+\cdots) at query time for each term–document match. BM25S shifts all per-occurrence computations to indexing, enabling high-throughput query-time ranking via vectorized linear algebra.

5. Empirical Benchmarks

BM25S was evaluated on 14 BEIR zero-shot benchmark datasets (e.g., ArguAna, Climate-FEVER, CQADupStack, DBPedia, FEVER, FiQA, HotpotQA, MS-MARCO, NFCorpus, NaturalQuestions, Quora, SciDocs, SciFact, TREC-COVID, Touche2020) on a single-threaded Intel Xeon (2.2 GHz, 30 GB RAM). Throughput is measured in queries per second (QPS):

Dataset BM25S QPS Elasticsearch QPS BM25-PT QPS Rank-BM25 QPS Relative Speedup (BM25S/ES)
ArguAna 574 13.7 110.5 2.0 ~287×
NFCorpus 1,196 45.8 256.7 224.7 ~26×

On 10 of 14 datasets, BM25S achieves >100×>100\times speedup over Rank-BM25, peaking at 500×500\times (ArguAna). Against Java Elasticsearch, BM25S achieves $2$–10×10\times higher QPS on most cases. NDCG@10 evaluation shows that adding a Snowball stemmer and English stopword removal can improve effectiveness from 38.4 to 39.7 average, confirming parity or slight superiority versus established toolkits (Lù, 2024).

6. Implementation Details and Recipes

  • Tokenization & Vocabulary: Default uses Scikit-Learn’s regex r"(?u)\b\w\w+\b"; (optionally) C-based libstemmer. Each token is mapped to its integer vocabulary index.
  • Index Construction:

1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
from scipy.sparse import csc_matrix
rows, cols, data = [], [], []
idf = np.log((N - df + 0.5)/(df + 0.5) + 1) # shape (|V|,)
for d, doc_terms in enumerate(corpus):
    Ld = len(doc_terms)
    norm = k1*(1 - b + b*(Ld/L_avg))
    for t_idx, tf in doc_terms.items():
        denom = tf + norm
        score = idf[t_idx] * tf / denom
        rows.append(t_idx); cols.append(d); data.append(score)
M = csc_matrix((data,(rows,cols)), shape=(|V|,N))

  • Querying & Top-k Selection:

1
2
3
4
sub = M[query_row_indices, :] # m x N sparse
scores = np.asarray(sub.sum(axis=0)).ravel()
topk_idx = np.argpartition(scores, -k)[-k:]
topk_sorted = topk_idx[np.argsort(scores[topk_idx])[::-1]]

  • Non-occurrence Variants: Store Δ=scorebase_score\Delta = \text{score} - \text{base\_score} in matrix, with a small V|V|-length array for base scores; at query, O(m)O(m) scalar addition retrieves the global offset.
  • Optional Accelerations: Employ JAX’s jax.lax.top_k for faster selection or wrap matrix operations in a thread pool for multi-threaded throughput.

7. Limitations and Deployment Considerations

  • Index-Time Resource Usage: Each term occurrence is precomputed and stored as a float (vs. integer TF\mathrm{TF} for classic inverted indexes). Large corpora (e.g., 2×1062\times10^6 documents, 2×1052\times10^5 vocabulary) remain sparse, but RAM requirements can reach tens of GB.
  • Parameter Fixity: Parameters k1k_1 and bb are fixed at index time. Modifying them requires index rebuilding, unlike Rank-BM25, which supports query-time parameter adjustment.
  • Tokenizer Choice: The provided regex+stemmer combination offers a balance of speed and fidelity. Language-specific analyzers may require customization.
  • Index Maintenance: BM25S is suited to mostly-static corpora. Document additions/deletions require partial or full reindexing; incremental updates are nontrivial.
  • Non-Occurrence Overhead: BM25+ and related variants incur an extra O(m)O(m) per-query addition, which remains negligible relative to main computation.

BM25S “rolls up” all expensive term-document scoring into an index-time matrix, transforming query-time retrieval to a small set of vectorized operations and top-kk selection. Its speed, exactness for BM25 and variants, minimal dependencies, and ease-of-integration make it suitable for both research and production, from server deployments to browser-based Pyodide execution (Lù, 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BM25S.