Papers
Topics
Authors
Recent
Search
2000 character limit reached

Text-to-Vec: Token-Level Contextual Embedding

Updated 14 November 2025
  • Text-to-Vec is a method that converts text into an interpretable vector by computing token-level local n-gram perplexities.
  • It aggregates n-gram probabilities from autoregressive transformers to highlight model uncertainty and detect errors.
  • Empirical evaluations show significant improvements in local error detection compared to scalar perplexity, with robust metrics across tasks.

A Text-to-Vec module produces a vectorial representation of a text input (sentence, document, or token sequence), encoding properties such as local or global contextual probability, semantics, or structure in an embedding suitable for downstream tasks. In "Text vectorization via transformer-based LLMs and n-gram perplexities" (Škorić, 2023), Škorić proposes a method for contextual text vectorization that departs from traditional scalar perplexity by generating an “N-dimensional perplexity vector” tied to local n-gram surprisal as scored by an autoregressive transformer LLM.

1. Algorithmic Framework: Local N-gram Perplexity Vector

The core algorithm computes a per-token vector of local perplexities by aggregating n-gram window probabilities as estimated by a transformer.

  • Tokenization: The input text of NN tokens is segmented as w1,w2,,wNw_1, w_2, \dots, w_N. In the principal experiments, the tokenizer yields word-level tokens; subword tokenizers are permissible.
  • N-gram Extraction: From w1,...,wNw_1, ..., w_N, extract all contiguous n-grams ti=(wi,wi+1,...,wi+n1)t_i = (w_i, w_{i+1}, ..., w_{i+n-1}) for ii from $1$ to Nn+1N-n+1.
  • Probabilistic Scoring: Each tit_i is evaluated via next-token prediction of a pre-trained transformer LM, computing the joint probability

p(ti)=j=0n1P(wi+jw1,...,wi+j1)p(t_i) = \prod_{j=0}^{n-1} P(w_{i+j} | w_1, ..., w_{i+j-1})

  • N-gram Perplexity: The n-gram perplexity is defined as

PP(ti)=p(ti)1/n=exp(1nj=0n1logP(wi+jw1,...,wi+j1))PP(t_i) = p(t_i)^{-1/n} = \exp \left(-\frac{1}{n} \sum_{j=0}^{n-1} \log P(w_{i+j} | w_1, ..., w_{i+j-1}) \right)

  • Token-level Local Perplexity: For each position w1,w2,,wNw_1, w_2, \dots, w_N0, aggregate the perplexities of all n-grams including w1,w2,,wNw_1, w_2, \dots, w_N1:

w1,w2,,wNw_1, w_2, \dots, w_N2

w1,w2,,wNw_1, w_2, \dots, w_N3

  • Vector Assembly: The final embedding is w1,w2,,wNw_1, w_2, \dots, w_N4. Optionally, w1,w2,,wNw_1, w_2, \dots, w_N5 may be standardized for "relative perplexity" but this is not part of the baseline.

This process returns an interpretable vector that highlights localized model uncertainty, sensitive to rare words or improbable sequences, as opposed to a single scalar summary.

2. Mathematical Formulation

All key statistics are computed strictly as described in the main text:

Quantity Expression Comment
Joint Probability w1,w2,,wNw_1, w_2, \dots, w_N6 Over n-gram w1,w2,,wNw_1, w_2, \dots, w_N7
N-gram Perplexity w1,w2,,wNw_1, w_2, \dots, w_N8 Local surprisal measure
Token-wise Index w1,w2,,wNw_1, w_2, \dots, w_N9 All n-grams covering w1,...,wNw_1, ..., w_N0
Local Perplexity w1,...,wNw_1, ..., w_N1 Centered at token w1,...,wNw_1, ..., w_N2
Final Vector w1,...,wNw_1, ..., w_N3 N-dimensional output

This explicit per-token aggregation preserves distributional detail that is discarded in classical scalar perplexity.

3. Architectural and Hyperparameter Choices

  • Transformer Model: Any auto-regressive transformer LM. Examples are GPT-2 and (for Serbian evaluation) a GPT-2 variant trained on the Serbian corpus. Only the final softmax probabilities are required.
  • Sliding Window Size w1,...,wNw_1, ..., w_N4: For the worked example, w1,...,wNw_1, ..., w_N5; for empirical tasks, w1,...,wNw_1, ..., w_N6 to ensure a reasonable number of windows for sentences w1,...,wNw_1, ..., w_N7.
  • Stride: Always 1 (fully overlapping windows).
  • Normalization: Optional (subtract mean, divide by standard deviation); not included as default.
  • Layer Output: The method discards hidden activations, utilizing only the probability estimates.

The permutation and coverage of the n-gram windows ensures that edge tokens are not neglected: tokens near boundaries participate in fewer w1,...,wNw_1, ..., w_N8-grams.

4. Worked Example

Consider the input “When in Rome, do as the Romans do.” (w1,...,wNw_1, ..., w_N9):

Token (ti=(wi,wi+1,...,wi+n1)t_i = (w_i, w_{i+1}, ..., w_{i+n-1})0) Windows covering ti=(wi,wi+1,...,wi+n1)t_i = (w_i, w_{i+1}, ..., w_{i+n-1})1 ti=(wi,wi+1,...,wi+n1)t_i = (w_i, w_{i+1}, ..., w_{i+n-1})2 computation
1 (When) ti=(wi,wi+1,...,wi+n1)t_i = (w_i, w_{i+1}, ..., w_{i+n-1})3 ti=(wi,wi+1,...,wi+n1)t_i = (w_i, w_{i+1}, ..., w_{i+n-1})4
2 (in) ti=(wi,wi+1,...,wi+n1)t_i = (w_i, w_{i+1}, ..., w_{i+n-1})5 ti=(wi,wi+1,...,wi+n1)t_i = (w_i, w_{i+1}, ..., w_{i+n-1})6
3 (Rome) ti=(wi,wi+1,...,wi+n1)t_i = (w_i, w_{i+1}, ..., w_{i+n-1})7 ti=(wi,wi+1,...,wi+n1)t_i = (w_i, w_{i+1}, ..., w_{i+n-1})8
... ... ...
10 (.) ti=(wi,wi+1,...,wi+n1)t_i = (w_i, w_{i+1}, ..., w_{i+n-1})9 ii0

This yields ii1. High ii2 (e.g., at token 2) indicates a local probability dip, often due to a modeling anomaly or typo.

5. Empirical Evaluation and Use Cases

The method's diagnostic value is demonstrated through three error-detection tasks—removal, insertion, replacement of a word—on expert-translated Serbian sentences (total ii3), each altered at one position. The evaluation protocol is as follows:

  • Task: Predict the error index by selecting the token with maximum local perplexity (ii4).
  • Metrics: Accuracy (correct position identified), Weighted accuracy (scaled by ii5), and comparison to random baseline.

Results (Table 4, paper):

Task Random Text-to-Vec
Removal 5.80 % 10.37 %
Insertion 3.12 % 17.26 %
Replacement 2.02 % 18.56 %

Weighted accuracy improves by a factor of 3–8 over chance. High Pearson correlation (ii6) between accuracy and weighted accuracy confirms robustness across sentence lengths. Notably, scalar perplexity does not provide this localization power, and no direct comparison to BERT embeddings or global perplexity was reported.

6. Implementation, Complexity, and Limitations

  • Computational Cost: For sentence of ii7 tokens and window size ii8, ii9 LM forward passes are required, each over $1$0 tokens. For small $1$1 and typical text lengths, this cost is dominated by the LM inference; parallel processing of windows is practical.
  • Limitations: The text-to-Vec method is inherently tied to the LM’s probabilistic calibration. Mis-calibrated transformer LMs (or domain mismatch) will affect $1$2 interpretability. The vector’s dimensionality scales with input length, which may pose issues for downstream models expecting fixed-size vectors.
  • Deployment Considerations: Efficient inference necessitates batched n-gram scoring, and optional vector normalization for applications demanding scale invariance. Adopters should select $1$3 to balance localization against window sparsity.

7. Applications and Potential Extensions

The primary utility is in local error detection—identifying marginalized tokens within high-probability contexts. Beyond this, plausible extensions include:

  • Fine-grained quality assessment in translation, ASR, or OCR—flagging outlier tokens.
  • Surprisal pattern-based similarity search—retrieving sentences with similar distributions of local modeling "surprise."
  • Integration of the $1$4 vector into error-detection classifiers or explanation systems, leveraging per-token probabilities as features.
  • Open areas: comparison with dense sentence embeddings, calibration on out-of-distribution text, and use in languages without robust LM support.

As an algorithmic primitive, the Text-to-Vec module offers practitioners an explicit, interpretable embedding of token-level LLM uncertainty—distinct from both scalar perplexity and black-box dense embedding methods—enabling novel downstream analytics and diagnostics in transformer-based NLP systems (Škorić, 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Text-to-Vec Module.