Subword TF-IDF for Multilingual Retrieval
- STF-IDF is an extension of TF-IDF that uses learned subword segmentation to create language-agnostic and robust text representations.
- It eliminates the need for manual tokenization, stop-word filtering, and stemming by using a unified subword model trained on multilingual corpora.
- Empirical results on the XQuAD dataset show STF-IDF achieves competitive accuracy across 12 languages, enhancing cross-lingual search performance.
Subword term frequency-inverse document frequency (STF-IDF) is an information retrieval technique that generalizes the canonical TF-IDF scheme to operate directly on subword units, as opposed to word-level tokens. STF-IDF was developed to overcome the shortcomings of word-based tokenization for multilingual retrieval and robust text representation, providing a formulation that dispenses with hand-crafted rules such as language-specific stopword lists or stemming heuristics. With subword vocabulary and segmentation learned from raw multilingual corpora, STF-IDF enables unified, language-agnostic retrieval across typologically diverse scripts and languages, as demonstrated with XQuAD evaluation benchmarks (Wangperawong, 2022).
1. Formal Specification
Let denote the corpus of documents and the learned subword vocabulary of size . Both documents and queries are tokenized into multisets of subwords using a data-driven segmentation model. The core components of STF-IDF are as follows:
- Subword term frequency:
- Inverse document frequency (with add-one smoothing):
- STF-IDF weight:
- Each document/query receives a sparse vector representation:
- Similarity scoring (cosine similarity):
Unseen subwords (those never encountered in the corpus) receive maximal inverse document frequency, i.e., .
2. Comparison with Word-Based TF-IDF
Traditional word-level TF-IDF requires extensive language-dependent processing: manual specification of tokenization rules (e.g., handling whitespace and punctuation), hand-crafted stop-word lists to filter non-informative terms, and explicit stemming/lemmatization for morphological normalization. These heuristics must be engineered separately for each language and are typically brittle under agglutinative or non-segmented scripts (such as Thai or Chinese). STF-IDF eliminates these steps through universal data-driven subword segmentation:
- No language-specific tokenization—segmentation is produced by a single, learned model for all languages.
- No stop-word lists—infrequent subwords intrinsically receive high , while frequent subwords are down-weighted.
- No stemming—morphological variants share subword components and thus are automatically conflated in representation.
This design supports multilingual and cross-lingual IR, reduces per-language maintenance overhead, and robustly handles novel linguistic phenomena without bespoke pipelines (Wangperawong, 2022).
3. Subword Tokenizer Model and Multilingual Coverage
The subword vocabulary is obtained via a SentencePiece model trained with byte-pair encoding (BPE), informed by methodologies of Sennrich et al. (2015) and multilingual recommendations of Fan et al. (2021). Training utilizes Wikipedia dumps from the top 100 largest character-based languages, augmented with additional monolingual data for low-resource languages. Character coverage is set at 0.9995 to include infrequent symbols.
To address resource imbalance, language sampling probability for language is defined as:
This temperature-adjusted sampling upweights low-resource languages during subword vocabulary construction.
Distinct scripts and rare transliterations are decomposed into elements from the shared codebook. The resulting unified subword space directly supports mixed-language or multilingual queries without modifications to the STF-IDF recipe.
4. Implementation and Retrieval Mechanics
STF-IDF uses raw count for with normalization of vectors enforced during cosine similarity computation. Optionally, logarithmic scaling may be applied. The smoothed denominator ensures nonzero values for all subwords.
The closed nature of guarantees that previously unseen strings are recursively decomposed into smaller known subwords (“unk” token handling per SentencePiece). Efficient passage retrieval is achieved by building an inverted index mapping to lists , enabling rapid sparse dot-product calculation across candidate vectors by intersecting active postings for the query subwords.
5. Empirical Evaluation on XQuAD
STF-IDF was evaluated on the XQuAD dataset, which contains 240 English Wikipedia paragraphs (each with multiple QA pairs, aligned into 12 languages). Each language features 1,190 question texts, with the retrieval task requiring selection of the paragraph most relevant to each question. The evaluation metric is strict accuracy:
CPU-based retrieval (standard desktop CPU, no GPU) was used for all experiments. Baseline configurations included:
- word tokenization only
- word stop-word removal
- word stemming (Porter)
- word stop-word removal stemming
Experimental Results
The following tables summarize results for retrieval accuracy:
Word-Based TF-IDF on English
| Tokenization | Accuracy (%) |
|---|---|
| word | 84.2 |
| word → stop | 83.9 |
| word → stem | 84.9 |
| word → stop → stem | 85.2 |
Subword TF-IDF on English
| Tokenization | Accuracy (%) |
|---|---|
| subword | 85.4 |
| word → stop → subword | 84.2 |
| word → stem → subword | 85.4 |
| word → stop → stem → subword | 84.5 |
Multilingual STF-IDF Accuracy on XQuAD
| Language (code) | Accuracy (%) |
|---|---|
| English (en) | 85.4 |
| Spanish (es) | 85.8 |
| German (de) | 84.9 |
| Greek (el) | 81.3 |
| Russian (ru) | 82.9 |
| Turkish (tr) | 80.1 |
| Arabic (ar) | 77.1 |
| Vietnamese (vi) | 84.5 |
| Thai (th) | 83.5 |
| Chinese (zh) | 82.4 |
| Hindi (hi) | 80.9 |
| Romanian (ro) | 85.0 |
The highest English accuracy for STF-IDF (85.4%) marginally outperforms the word-based best recipe (85.2%, word stop stem). STF-IDF achieves consistent >80% accuracy in 11 additional languages. Statistical significance testing was not reported, but absolute gains of 1% and cross-language robustness are indicative of system-level improvements (Wangperawong, 2022).
6. Qualitative Analysis and Engineering Impact
STF-IDF offers marked robustness to out-of-vocabulary (OOV) and morphologically complex terms, as rare or novel constructs (e.g., names, compounds, loanwords) are decomposed into known subwords, ensuring information preservation in retrieval. Subword units capture functional morphemic boundaries such as prefixes and suffixes, facilitating automatic down-weighting of non-content morphemes via .
Operating within a unified vector space, the approach supports cross-lingual and code-mixed queries without language-specific configuration. Maintenance and engineering burden are materially reduced as the removal of language-specific stop lists and stemming heuristics eliminates the need for ongoing manual curation; subword model retraining is sufficient to adapt to shifting language use or novel domains.
7. Open-Source Resources and Reproducibility
Complete reference implementations, including the pre-trained 128k-token SentencePiece model, STF-IDF indexing scripts, and passage retrieval demo notebooks, are available in the Text2Text software repository (https://github.com/artitw/text2text). The package is installable via PyPI (pip install text2text). Configuration files, training arguments, and all assets required for result reproduction are provided openly (Wangperawong, 2022).