Semantic Sensitive TF–IDF (STF-IDF)

Updated 16 January 2026

STF-IDF is a variant of TF–IDF that uses subword tokenization to enable robust multilingual retrieval without language-specific preprocessing.
It trains a universal SentencePiece/BPE subword vocabulary, allowing effective handling of diverse languages and code-mixed text.
The methodology combines subword tokenization, weight calculation through TF–IDF (with sublinear scaling and smoothing), and cosine similarity ranking for enhanced retrieval performance.

Subword TF–IDF (STF-IDF) is a variant of the traditional TF–IDF framework that utilizes subword-level tokenization, thereby obviating the need for language-specific heuristics such as stop-word lists, stemmers, or whitespace-driven tokenization. STF-IDF is designed for robust multilingual information retrieval, where a single subword vocabulary allows seamless application across diverse languages and code-mixed text. By leveraging subword units learned via parameterized algorithms such as SentencePiece BPE, STF-IDF achieves higher retrieval accuracy than word-based methods, enabling effective search over heterogeneous corpora without manual preprocessing (Wangperawong, 2022).

1. Pipeline and Methodology

STF-IDF consists of a sequence of phases: subword tokenizer training, document tokenization, weight calculation, and similarity-driven retrieval. Starting with a large-scale multilingual corpus—specifically, Wikipedia articles in the top 100 character-based languages—a SentencePiece/BPE model is trained with vocabulary size $V=128\,000$ , character coverage $=0.9995$ , and temperature sampling $T=5$ , the latter balancing representation for low-resource languages. The output is a universal subword vocabulary. Documents and queries are encoded into subwords via the trained encoder, bypassing language-specific preprocessing. For each document $d$ and subword $s$ , raw term frequencies $tf_{s,d}$ , document frequencies $df_s$ , and inverse document frequencies $idf_s = \log(N/df_s)$ are computed, where $N$ is the corpus size. Vector representations are constructed using $tfidf_{s,d} = tf_{s,d} \times idf_s$ , optionally applying sublinear TF scaling and IDF smoothing. Each vector is $L_2$ -normalized. Retrieval operates on cosine similarity between query and document vectors; documents are ranked according to the score $cosine(v_q, v_d) = v_q \cdot v_d$ .

2. Mathematical Formalism

Let $S$ denote the subword vocabulary, and $D$ denote the set of documents ( $|D|=N$ ). The STF-IDF system is governed by the following:

Term frequency: $tf_{s,d} = \#\{\text{occurrences of } s \text{ in } d\}$ ,
Document frequency: $df_s = |\{d \in D : tf_{s,d} > 0\}|$ ,
Inverse document frequency: $idf_s = \log \frac{N}{df_s}$ ,
TF–IDF weight: $tfidf_{s,d} = tf_{s,d} \times idf_s$ ,
Sublinear TF scaling (optional):

$tf'_{s,d} = \begin{cases} 1 + \log(tf_{s,d}) & tf_{s,d} > 0\ 0 & \text{otherwise} \end{cases}, \quad tfidf_{s,d} = tf'_{s,d} \times idf_s$

L2-normalized document vector: $\hat v_d = v_d/\|v_d\|_2$ ,
Cosine similarity:

$sim(q, d) = \hat v_q \cdot \hat v_d = \sum_{s \in S} \hat v_q[s]\,\hat v_d[s]$

3. Normalization and Weighting Strategies

STF-IDF supports several normalization and weighting schemes:

Raw TF versus sublinear TF scaling: Sublinear scaling ( $tf'_{s,d} = 1 + \log(tf_{s,d})$ ) reduces the dominance of extremely frequent subwords, which is beneficial for corpora exhibiting high frequency skew.
IDF smoothing: Using $\log(1+N/df_s)$ in lieu of $\log(N/df_s)$ mitigates zero-division and suppresses outlier IDF values for rare subwords.
L $_2$ normalization: All document and query vectors are normalized to unit length, ensuring that cosine similarity reduces to a dot product in the unit hypersphere.

4. Implementation Framework and Pseudocode

The end-to-end pipeline consists of three modular stages: tokenizer training, index construction, and retrieval.

spm.SentencePieceTrainer.Train(
    input=input_files,
    model_prefix="multi",
    vocab_size=128000,
    character_coverage=0.9995,
    model_type="bpe"
)
SP = SentencePieceProcessor("multi.model")

for d in docs:
    tokens = SP.encode(d.text)
    counts = Counter(tokens)
    for s, f in counts.items():
        tf[(d.id, s)] = f
        postings.setdefault(s, set()).add(d.id)
idf = {s: log(N / len(postings[s])) for s in postings}
for d in docs:
    v = {}
    for s in postings:
        f = tf.get((d.id, s), 0)
        if f>0:
            w = f * idf[s]
            v[s] = w
    norm = sqrt(sum(w*w for w in v.values())) + 1e-12
    for s in v: v[s] /= norm
    vectors[d.id] = v

q_tokens = SP.encode(query)
q_counts = Counter(q_tokens)
q_v = {}
for s, f in q_counts.items():
    if s in idf:
        q_v[s] = f * idf[s]
norm_q = sqrt(sum(w*w for w in q_v.values())) + 1e-12
for s in q_v: q_v[s] /= norm_q
candidates = set().union(*(postings.get(s,()) for s in q_v))
scores = []
for doc_id in candidates:
    d_v = vectors[doc_id]
    score = sum(q_v[s]*d_v.get(s,0.0) for s in q_v)
    scores.append((score, doc_id))
scores.sort(reverse=True)
return scores[:top_k]

5. Comparative Advantages over Classic TF–IDF

STF-IDF eliminates dependence on language-specific preprocessing, yielding notable operational and empirical advantages:

Tokenization independence: Subwords subsume both space-delimited and agglutinative scripts, requiring no rules for whitespace or multi-script segmentation.
Stop-word elimination: High frequency subwords automatically receive lower weight by IDF, obviating curated stop-list management.
Morphological normalization via subword modeling: Language-specific stemmers are unnecessary; the learned subword inventory intrinsically regularizes varied inflections and enables robust handling of out-of-vocabulary forms.
Universal vocabulary: A single subword set enables simultaneous cross-lingual and code-mixed retrieval without adaptation or re-training.

6. Experimental Evaluation

STF-IDF was systematically benchmarked via the XQuAD evaluation protocol (Artetxe et al., 2019): 1 190 queries entail retrieving the correct paragraph from 240 parallel passages per language. Retrieval is deemed correct if the top-scoring hit contains the annotated answer.

Empirical results demonstrate:

Configuration	Accuracy (en)	Accuracy (non-en)
Word-only TF-IDF	84.2%	N/A
Word + stopword	83.9%	N/A
Word + stemming	84.9%	N/A
Word + stop + stem	85.2%	N/A
STF-IDF (no heuristics)	85.4%	$\geq$ 80% (10 langs)

Specific non-English STF-IDF scores include: Spanish 85.8%, German 84.9%, Greek 81.3%, Russian 82.9%, Turkish 80.1%, Arabic 77.1%, Vietnamese 84.5%, Thai 83.5%, Chinese 82.4%, Hindi 80.9%, Romanian 85.0%. This suggests STF-IDF delivers consistent top-1 retrieval performance across typologically diverse languages using an identical model (Wangperawong, 2022).

7. Practical Implementation and Hyperparameters

STF-IDF leverages a SentencePiece BPE tokenizer with hyperparameters: vocab_size=128 000, character_coverage=0.9995, and temperature sampling $T=5$ for balanced multilingual modeling. Corpus construction utilizes probabilities $p_l \propto D_l^{1/T}$ proportional to each language’s Wikipedia size. Optimal performance is achieved with raw TF or optional sublinear scaling for highly skewed datasets, standard IDF or smoothed values for stability, and cosine similarity for retrieval. The reference implementation and reproducibility toolkit (Text2Text) are publicly available: https://github.com/artitw/text2text.

A plausible implication is that STF-IDF can be adapted to arbitrary multilingual document collections, extended by custom subword vocabulary training or weighting strategies, and integrated into cross-lingual search workflows without reliance on corpus-specific heuristics.

Markdown Report Issue Upgrade to Chat

References (1)

Multilingual Search with Subword TF-IDF (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semantic Sensitive TF–IDF (STF-IDF).

Semantic Sensitive TF–IDF (STF-IDF)

1. Pipeline and Methodology

2. Mathematical Formalism

3. Normalization and Weighting Strategies

4. Implementation Framework and Pseudocode

5. Comparative Advantages over Classic TF–IDF

6. Experimental Evaluation

7. Practical Implementation and Hyperparameters

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Semantic Sensitive TF–IDF (STF-IDF)

1. Pipeline and Methodology

2. Mathematical Formalism

3. Normalization and Weighting Strategies

4. Implementation Framework and Pseudocode

5. Comparative Advantages over Classic TF–IDF

6. Experimental Evaluation

7. Practical Implementation and Hyperparameters

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research