Papers
Topics
Authors
Recent
Search
2000 character limit reached

Subword TF-IDF for Multilingual Retrieval

Updated 16 January 2026
  • STF-IDF is an extension of TF-IDF that uses learned subword segmentation to create language-agnostic and robust text representations.
  • It eliminates the need for manual tokenization, stop-word filtering, and stemming by using a unified subword model trained on multilingual corpora.
  • Empirical results on the XQuAD dataset show STF-IDF achieves competitive accuracy across 12 languages, enhancing cross-lingual search performance.

Subword term frequency-inverse document frequency (STF-IDF) is an information retrieval technique that generalizes the canonical TF-IDF scheme to operate directly on subword units, as opposed to word-level tokens. STF-IDF was developed to overcome the shortcomings of word-based tokenization for multilingual retrieval and robust text representation, providing a formulation that dispenses with hand-crafted rules such as language-specific stopword lists or stemming heuristics. With subword vocabulary and segmentation learned from raw multilingual corpora, STF-IDF enables unified, language-agnostic retrieval across typologically diverse scripts and languages, as demonstrated with XQuAD evaluation benchmarks (Wangperawong, 2022).

1. Formal Specification

Let D={d1,,dN}D = \{d_1, \ldots, d_N\} denote the corpus of documents and Σ\Sigma the learned subword vocabulary of size Σ=128000|\Sigma| = 128\,000. Both documents dd and queries qq are tokenized into multisets of subwords sΣs \in \Sigma using a data-driven segmentation model. The core components of STF-IDF are as follows:

  • Subword term frequency:

tf(s,d)=number of occurrences of s in d\mathrm{tf}(s, d) = \text{number of occurrences of } s \text{ in } d

  • Inverse document frequency (with add-one smoothing):

IDF(s)=log(N1+{dD:tf(s,d)>0})\mathrm{IDF}(s) = \log \left( \frac{N}{1 + |\{ d \in D : \mathrm{tf}(s, d) > 0 \}|} \right)

  • STF-IDF weight:

w(s,d)=tf(s,d)×IDF(s)w(s, d) = \mathrm{tf}(s, d) \times \mathrm{IDF}(s)

  • Each document/query receives a sparse vector representation:

vd=(w(s,d))sΣ,vq=(w(s,q))sΣv_d = \left( w(s, d) \right)_{s \in \Sigma}, \qquad v_q = \left( w(s, q) \right)_{s \in \Sigma}

  • Similarity scoring (cosine similarity):

score(q,d)=cos(vq,vd)=sΣw(s,q)w(s,d)vq2vd2\mathrm{score}(q, d) = \cos(v_q, v_d) = \frac{ \sum_{s \in \Sigma} w(s, q) w(s, d) }{ \|v_q\|_2\, \|v_d\|_2 }

Unseen subwords (those never encountered in the corpus) receive maximal inverse document frequency, i.e., IDFlogN\mathrm{IDF} \approx \log N.

2. Comparison with Word-Based TF-IDF

Traditional word-level TF-IDF requires extensive language-dependent processing: manual specification of tokenization rules (e.g., handling whitespace and punctuation), hand-crafted stop-word lists to filter non-informative terms, and explicit stemming/lemmatization for morphological normalization. These heuristics must be engineered separately for each language and are typically brittle under agglutinative or non-segmented scripts (such as Thai or Chinese). STF-IDF eliminates these steps through universal data-driven subword segmentation:

  • No language-specific tokenization—segmentation is produced by a single, learned model for all languages.
  • No stop-word lists—infrequent subwords intrinsically receive high IDF\mathrm{IDF}, while frequent subwords are down-weighted.
  • No stemming—morphological variants share subword components and thus are automatically conflated in representation.

This design supports multilingual and cross-lingual IR, reduces per-language maintenance overhead, and robustly handles novel linguistic phenomena without bespoke pipelines (Wangperawong, 2022).

3. Subword Tokenizer Model and Multilingual Coverage

The subword vocabulary Σ\Sigma is obtained via a SentencePiece model trained with byte-pair encoding (BPE), informed by methodologies of Sennrich et al. (2015) and multilingual recommendations of Fan et al. (2021). Training utilizes Wikipedia dumps from the top 100 largest character-based languages, augmented with additional monolingual data for low-resource languages. Character coverage is set at 0.9995 to include infrequent symbols.

To address resource imbalance, language sampling probability for language ll is defined as:

pl=DliDi,pl(T)pl1/T,T=5p_l = \frac{D_l}{\sum_i D_i}, \quad p_l^{(T)} \propto p_l^{1/T}, \qquad T = 5

This temperature-adjusted sampling upweights low-resource languages during subword vocabulary construction.

Distinct scripts and rare transliterations are decomposed into elements from the shared codebook. The resulting unified subword space directly supports mixed-language or multilingual queries without modifications to the STF-IDF recipe.

4. Implementation and Retrieval Mechanics

STF-IDF uses raw count for tf(s,d)\mathrm{tf}(s,d) with 2\ell^2 normalization of vectors enforced during cosine similarity computation. Optionally, logarithmic scaling tf(s,d)=1+logtf(s,d)\mathrm{tf'}(s,d) = 1+\log\,\mathrm{tf}(s,d) may be applied. The smoothed IDF\mathrm{IDF} denominator ensures nonzero values for all subwords.

The closed nature of Σ\Sigma guarantees that previously unseen strings are recursively decomposed into smaller known subwords (“unk” token handling per SentencePiece). Efficient passage retrieval is achieved by building an inverted index mapping ss to lists (d,tf(s,d))(d, \mathrm{tf}(s,d)), enabling rapid sparse dot-product calculation across candidate vectors by intersecting active postings for the query subwords.

5. Empirical Evaluation on XQuAD

STF-IDF was evaluated on the XQuAD dataset, which contains 240 English Wikipedia paragraphs (each with multiple QA pairs, aligned into 12 languages). Each language features 1,190 question texts, with the retrieval task requiring selection of the paragraph most relevant to each question. The evaluation metric is strict accuracy:

Accuracy=# correct paragraph matches1190\text{Accuracy} = \frac{\text{\# correct paragraph matches}}{1190}

CPU-based retrieval (standard desktop CPU, no GPU) was used for all experiments. Baseline configurations included:

  • word tokenization only
  • word \rightarrow stop-word removal
  • word \rightarrow stemming (Porter)
  • word \rightarrow stop-word removal \rightarrow stemming

Experimental Results

The following tables summarize results for retrieval accuracy:

Word-Based TF-IDF on English

Tokenization Accuracy (%)
word 84.2
word → stop 83.9
word → stem 84.9
word → stop → stem 85.2

Subword TF-IDF on English

Tokenization Accuracy (%)
subword 85.4
word → stop → subword 84.2
word → stem → subword 85.4
word → stop → stem → subword 84.5

Multilingual STF-IDF Accuracy on XQuAD

Language (code) Accuracy (%)
English (en) 85.4
Spanish (es) 85.8
German (de) 84.9
Greek (el) 81.3
Russian (ru) 82.9
Turkish (tr) 80.1
Arabic (ar) 77.1
Vietnamese (vi) 84.5
Thai (th) 83.5
Chinese (zh) 82.4
Hindi (hi) 80.9
Romanian (ro) 85.0

The highest English accuracy for STF-IDF (85.4%) marginally outperforms the word-based best recipe (85.2%, word \rightarrow stop \rightarrow stem). STF-IDF achieves consistent >80% accuracy in 11 additional languages. Statistical significance testing was not reported, but absolute gains of >>1% and cross-language robustness are indicative of system-level improvements (Wangperawong, 2022).

6. Qualitative Analysis and Engineering Impact

STF-IDF offers marked robustness to out-of-vocabulary (OOV) and morphologically complex terms, as rare or novel constructs (e.g., names, compounds, loanwords) are decomposed into known subwords, ensuring information preservation in retrieval. Subword units capture functional morphemic boundaries such as prefixes and suffixes, facilitating automatic down-weighting of non-content morphemes via IDF\mathrm{IDF}.

Operating within a unified vector space, the approach supports cross-lingual and code-mixed queries without language-specific configuration. Maintenance and engineering burden are materially reduced as the removal of language-specific stop lists and stemming heuristics eliminates the need for ongoing manual curation; subword model retraining is sufficient to adapt to shifting language use or novel domains.

7. Open-Source Resources and Reproducibility

Complete reference implementations, including the pre-trained 128k-token SentencePiece model, STF-IDF indexing scripts, and passage retrieval demo notebooks, are available in the Text2Text software repository (https://github.com/artitw/text2text). The package is installable via PyPI (pip install text2text). Configuration files, training arguments, and all assets required for result reproduction are provided openly (Wangperawong, 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Subword TF-IDF (STF-IDF).