Semantic Sensitive TF–IDF (STF-IDF)
- STF-IDF is a variant of TF–IDF that uses subword tokenization to enable robust multilingual retrieval without language-specific preprocessing.
- It trains a universal SentencePiece/BPE subword vocabulary, allowing effective handling of diverse languages and code-mixed text.
- The methodology combines subword tokenization, weight calculation through TF–IDF (with sublinear scaling and smoothing), and cosine similarity ranking for enhanced retrieval performance.
Subword TF–IDF (STF-IDF) is a variant of the traditional TF–IDF framework that utilizes subword-level tokenization, thereby obviating the need for language-specific heuristics such as stop-word lists, stemmers, or whitespace-driven tokenization. STF-IDF is designed for robust multilingual information retrieval, where a single subword vocabulary allows seamless application across diverse languages and code-mixed text. By leveraging subword units learned via parameterized algorithms such as SentencePiece BPE, STF-IDF achieves higher retrieval accuracy than word-based methods, enabling effective search over heterogeneous corpora without manual preprocessing (Wangperawong, 2022).
1. Pipeline and Methodology
STF-IDF consists of a sequence of phases: subword tokenizer training, document tokenization, weight calculation, and similarity-driven retrieval. Starting with a large-scale multilingual corpus—specifically, Wikipedia articles in the top 100 character-based languages—a SentencePiece/BPE model is trained with vocabulary size , character coverage , and temperature sampling , the latter balancing representation for low-resource languages. The output is a universal subword vocabulary. Documents and queries are encoded into subwords via the trained encoder, bypassing language-specific preprocessing. For each document and subword , raw term frequencies , document frequencies , and inverse document frequencies are computed, where is the corpus size. Vector representations are constructed using , optionally applying sublinear TF scaling and IDF smoothing. Each vector is -normalized. Retrieval operates on cosine similarity between query and document vectors; documents are ranked according to the score .
2. Mathematical Formalism
Let denote the subword vocabulary, and denote the set of documents (). The STF-IDF system is governed by the following:
- Term frequency: ,
- Document frequency: ,
- Inverse document frequency: ,
- TF–IDF weight: ,
- Sublinear TF scaling (optional):
- L2-normalized document vector: ,
- Cosine similarity:
3. Normalization and Weighting Strategies
STF-IDF supports several normalization and weighting schemes:
- Raw TF versus sublinear TF scaling: Sublinear scaling () reduces the dominance of extremely frequent subwords, which is beneficial for corpora exhibiting high frequency skew.
- IDF smoothing: Using in lieu of mitigates zero-division and suppresses outlier IDF values for rare subwords.
- L normalization: All document and query vectors are normalized to unit length, ensuring that cosine similarity reduces to a dot product in the unit hypersphere.
4. Implementation Framework and Pseudocode
The end-to-end pipeline consists of three modular stages: tokenizer training, index construction, and retrieval.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
spm.SentencePieceTrainer.Train(
input=input_files,
model_prefix="multi",
vocab_size=128000,
character_coverage=0.9995,
model_type="bpe"
)
SP = SentencePieceProcessor("multi.model")
for d in docs:
tokens = SP.encode(d.text)
counts = Counter(tokens)
for s, f in counts.items():
tf[(d.id, s)] = f
postings.setdefault(s, set()).add(d.id)
idf = {s: log(N / len(postings[s])) for s in postings}
for d in docs:
v = {}
for s in postings:
f = tf.get((d.id, s), 0)
if f>0:
w = f * idf[s]
v[s] = w
norm = sqrt(sum(w*w for w in v.values())) + 1e-12
for s in v: v[s] /= norm
vectors[d.id] = v
q_tokens = SP.encode(query)
q_counts = Counter(q_tokens)
q_v = {}
for s, f in q_counts.items():
if s in idf:
q_v[s] = f * idf[s]
norm_q = sqrt(sum(w*w for w in q_v.values())) + 1e-12
for s in q_v: q_v[s] /= norm_q
candidates = set().union(*(postings.get(s,()) for s in q_v))
scores = []
for doc_id in candidates:
d_v = vectors[doc_id]
score = sum(q_v[s]*d_v.get(s,0.0) for s in q_v)
scores.append((score, doc_id))
scores.sort(reverse=True)
return scores[:top_k] |
5. Comparative Advantages over Classic TF–IDF
STF-IDF eliminates dependence on language-specific preprocessing, yielding notable operational and empirical advantages:
- Tokenization independence: Subwords subsume both space-delimited and agglutinative scripts, requiring no rules for whitespace or multi-script segmentation.
- Stop-word elimination: High frequency subwords automatically receive lower weight by IDF, obviating curated stop-list management.
- Morphological normalization via subword modeling: Language-specific stemmers are unnecessary; the learned subword inventory intrinsically regularizes varied inflections and enables robust handling of out-of-vocabulary forms.
- Universal vocabulary: A single subword set enables simultaneous cross-lingual and code-mixed retrieval without adaptation or re-training.
6. Experimental Evaluation
STF-IDF was systematically benchmarked via the XQuAD evaluation protocol (Artetxe et al., 2019): 1 190 queries entail retrieving the correct paragraph from 240 parallel passages per language. Retrieval is deemed correct if the top-scoring hit contains the annotated answer.
Empirical results demonstrate:
| Configuration | Accuracy (en) | Accuracy (non-en) |
|---|---|---|
| Word-only TF-IDF | 84.2% | N/A |
| Word + stopword | 83.9% | N/A |
| Word + stemming | 84.9% | N/A |
| Word + stop + stem | 85.2% | N/A |
| STF-IDF (no heuristics) | 85.4% | 80% (10 langs) |
Specific non-English STF-IDF scores include: Spanish 85.8%, German 84.9%, Greek 81.3%, Russian 82.9%, Turkish 80.1%, Arabic 77.1%, Vietnamese 84.5%, Thai 83.5%, Chinese 82.4%, Hindi 80.9%, Romanian 85.0%. This suggests STF-IDF delivers consistent top-1 retrieval performance across typologically diverse languages using an identical model (Wangperawong, 2022).
7. Practical Implementation and Hyperparameters
STF-IDF leverages a SentencePiece BPE tokenizer with hyperparameters: vocab_size=128 000, character_coverage=0.9995, and temperature sampling for balanced multilingual modeling. Corpus construction utilizes probabilities proportional to each language’s Wikipedia size. Optimal performance is achieved with raw TF or optional sublinear scaling for highly skewed datasets, standard IDF or smoothed values for stability, and cosine similarity for retrieval. The reference implementation and reproducibility toolkit (Text2Text) are publicly available: https://github.com/artitw/text2text.
A plausible implication is that STF-IDF can be adapted to arbitrary multilingual document collections, extended by custom subword vocabulary training or weighting strategies, and integrated into cross-lingual search workflows without reliance on corpus-specific heuristics.