Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semantic Sensitive TF–IDF (STF-IDF)

Updated 16 January 2026
  • STF-IDF is a variant of TF–IDF that uses subword tokenization to enable robust multilingual retrieval without language-specific preprocessing.
  • It trains a universal SentencePiece/BPE subword vocabulary, allowing effective handling of diverse languages and code-mixed text.
  • The methodology combines subword tokenization, weight calculation through TF–IDF (with sublinear scaling and smoothing), and cosine similarity ranking for enhanced retrieval performance.

Subword TF–IDF (STF-IDF) is a variant of the traditional TF–IDF framework that utilizes subword-level tokenization, thereby obviating the need for language-specific heuristics such as stop-word lists, stemmers, or whitespace-driven tokenization. STF-IDF is designed for robust multilingual information retrieval, where a single subword vocabulary allows seamless application across diverse languages and code-mixed text. By leveraging subword units learned via parameterized algorithms such as SentencePiece BPE, STF-IDF achieves higher retrieval accuracy than word-based methods, enabling effective search over heterogeneous corpora without manual preprocessing (Wangperawong, 2022).

1. Pipeline and Methodology

STF-IDF consists of a sequence of phases: subword tokenizer training, document tokenization, weight calculation, and similarity-driven retrieval. Starting with a large-scale multilingual corpus—specifically, Wikipedia articles in the top 100 character-based languages—a SentencePiece/BPE model is trained with vocabulary size V=128000V=128\,000, character coverage =0.9995=0.9995, and temperature sampling T=5T=5, the latter balancing representation for low-resource languages. The output is a universal subword vocabulary. Documents and queries are encoded into subwords via the trained encoder, bypassing language-specific preprocessing. For each document dd and subword ss, raw term frequencies tfs,dtf_{s,d}, document frequencies dfsdf_s, and inverse document frequencies idfs=log(N/dfs)idf_s = \log(N/df_s) are computed, where NN is the corpus size. Vector representations are constructed using tfidfs,d=tfs,d×idfstfidf_{s,d} = tf_{s,d} \times idf_s, optionally applying sublinear TF scaling and IDF smoothing. Each vector is L2L_2-normalized. Retrieval operates on cosine similarity between query and document vectors; documents are ranked according to the score cosine(vq,vd)=vqvdcosine(v_q, v_d) = v_q \cdot v_d.

2. Mathematical Formalism

Let SS denote the subword vocabulary, and DD denote the set of documents (D=N|D|=N). The STF-IDF system is governed by the following:

  • Term frequency: tfs,d=#{occurrences of s in d}tf_{s,d} = \#\{\text{occurrences of } s \text{ in } d\},
  • Document frequency: dfs={dD:tfs,d>0}df_s = |\{d \in D : tf_{s,d} > 0\}|,
  • Inverse document frequency: idfs=logNdfsidf_s = \log \frac{N}{df_s},
  • TF–IDF weight: tfidfs,d=tfs,d×idfstfidf_{s,d} = tf_{s,d} \times idf_s,
  • Sublinear TF scaling (optional):

tfs,d={1+log(tfs,d)tfs,d>0 0otherwise,tfidfs,d=tfs,d×idfstf'_{s,d} = \begin{cases} 1 + \log(tf_{s,d}) & tf_{s,d} > 0\ 0 & \text{otherwise} \end{cases}, \quad tfidf_{s,d} = tf'_{s,d} \times idf_s

  • L2-normalized document vector: v^d=vd/vd2\hat v_d = v_d/\|v_d\|_2,
  • Cosine similarity:

sim(q,d)=v^qv^d=sSv^q[s]v^d[s]sim(q, d) = \hat v_q \cdot \hat v_d = \sum_{s \in S} \hat v_q[s]\,\hat v_d[s]

3. Normalization and Weighting Strategies

STF-IDF supports several normalization and weighting schemes:

  • Raw TF versus sublinear TF scaling: Sublinear scaling (tfs,d=1+log(tfs,d)tf'_{s,d} = 1 + \log(tf_{s,d})) reduces the dominance of extremely frequent subwords, which is beneficial for corpora exhibiting high frequency skew.
  • IDF smoothing: Using log(1+N/dfs)\log(1+N/df_s) in lieu of log(N/dfs)\log(N/df_s) mitigates zero-division and suppresses outlier IDF values for rare subwords.
  • L2_2 normalization: All document and query vectors are normalized to unit length, ensuring that cosine similarity reduces to a dot product in the unit hypersphere.

4. Implementation Framework and Pseudocode

The end-to-end pipeline consists of three modular stages: tokenizer training, index construction, and retrieval.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
spm.SentencePieceTrainer.Train(
    input=input_files,
    model_prefix="multi",
    vocab_size=128000,
    character_coverage=0.9995,
    model_type="bpe"
)
SP = SentencePieceProcessor("multi.model")

for d in docs:
    tokens = SP.encode(d.text)
    counts = Counter(tokens)
    for s, f in counts.items():
        tf[(d.id, s)] = f
        postings.setdefault(s, set()).add(d.id)
idf = {s: log(N / len(postings[s])) for s in postings}
for d in docs:
    v = {}
    for s in postings:
        f = tf.get((d.id, s), 0)
        if f>0:
            w = f * idf[s]
            v[s] = w
    norm = sqrt(sum(w*w for w in v.values())) + 1e-12
    for s in v: v[s] /= norm
    vectors[d.id] = v

q_tokens = SP.encode(query)
q_counts = Counter(q_tokens)
q_v = {}
for s, f in q_counts.items():
    if s in idf:
        q_v[s] = f * idf[s]
norm_q = sqrt(sum(w*w for w in q_v.values())) + 1e-12
for s in q_v: q_v[s] /= norm_q
candidates = set().union(*(postings.get(s,()) for s in q_v))
scores = []
for doc_id in candidates:
    d_v = vectors[doc_id]
    score = sum(q_v[s]*d_v.get(s,0.0) for s in q_v)
    scores.append((score, doc_id))
scores.sort(reverse=True)
return scores[:top_k]

5. Comparative Advantages over Classic TF–IDF

STF-IDF eliminates dependence on language-specific preprocessing, yielding notable operational and empirical advantages:

  • Tokenization independence: Subwords subsume both space-delimited and agglutinative scripts, requiring no rules for whitespace or multi-script segmentation.
  • Stop-word elimination: High frequency subwords automatically receive lower weight by IDF, obviating curated stop-list management.
  • Morphological normalization via subword modeling: Language-specific stemmers are unnecessary; the learned subword inventory intrinsically regularizes varied inflections and enables robust handling of out-of-vocabulary forms.
  • Universal vocabulary: A single subword set enables simultaneous cross-lingual and code-mixed retrieval without adaptation or re-training.

6. Experimental Evaluation

STF-IDF was systematically benchmarked via the XQuAD evaluation protocol (Artetxe et al., 2019): 1 190 queries entail retrieving the correct paragraph from 240 parallel passages per language. Retrieval is deemed correct if the top-scoring hit contains the annotated answer.

Empirical results demonstrate:

Configuration Accuracy (en) Accuracy (non-en)
Word-only TF-IDF 84.2% N/A
Word + stopword 83.9% N/A
Word + stemming 84.9% N/A
Word + stop + stem 85.2% N/A
STF-IDF (no heuristics) 85.4% \geq80% (10 langs)

Specific non-English STF-IDF scores include: Spanish 85.8%, German 84.9%, Greek 81.3%, Russian 82.9%, Turkish 80.1%, Arabic 77.1%, Vietnamese 84.5%, Thai 83.5%, Chinese 82.4%, Hindi 80.9%, Romanian 85.0%. This suggests STF-IDF delivers consistent top-1 retrieval performance across typologically diverse languages using an identical model (Wangperawong, 2022).

7. Practical Implementation and Hyperparameters

STF-IDF leverages a SentencePiece BPE tokenizer with hyperparameters: vocab_size=128 000, character_coverage=0.9995, and temperature sampling T=5T=5 for balanced multilingual modeling. Corpus construction utilizes probabilities plDl1/Tp_l \propto D_l^{1/T} proportional to each language’s Wikipedia size. Optimal performance is achieved with raw TF or optional sublinear scaling for highly skewed datasets, standard IDF or smoothed values for stability, and cosine similarity for retrieval. The reference implementation and reproducibility toolkit (Text2Text) are publicly available: https://github.com/artitw/text2text.

A plausible implication is that STF-IDF can be adapted to arbitrary multilingual document collections, extended by custom subword vocabulary training or weighting strategies, and integrated into cross-lingual search workflows without reliance on corpus-specific heuristics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semantic Sensitive TF–IDF (STF-IDF).