Papers
Topics
Authors
Recent
Search
2000 character limit reached

Translation-Aware Contamination Detection

Updated 28 January 2026
  • Translation-Aware Contamination Detection is a method that accounts for multilanguage data leaks by analyzing translations of benchmark items.
  • It employs metrics like Cross-Lingual Consistency (CLC) and Index Recall Rate (IDR) to identify hidden contamination that inflates model performance.
  • Empirical results demonstrate high detection power (AUC > 0.9) and reveal language-dependent effects, guiding improvements in LLM evaluation.

Translation-Aware Contamination Detection refers to a class of contamination detection methodologies that explicitly account for the possibility of benchmark data leakage through translations, rather than only considering surface-level overlap in the source language. This is of critical importance in multilingual LLM evaluation, as translations of benchmarks can enter model pre-training corpora, thereby artificially inflating performance on the original task in the source language, while escaping traditional detection techniques focused on monolingual or surface-form overlap. Recent studies have demonstrated both the existence of cross-lingual contamination effects and the necessity of translation-aware detection protocols to ensure rigorous, fair, and reproducible assessment of LLM generalization (Yao et al., 2024, Abbas et al., 21 Jan 2026).

1. Foundations: Cross-Lingual Contamination and Its Formalization

Data contamination occurs when evaluation items or close variants are present in a model’s training data, thereby undermining claims of generalization. Traditional detection focuses on direct overlap, e.g., n-gram or token matching. In the translation-aware context, the contamination indicator is revised to account for translations:

cont(x)=1[L:T(x)Strain]\mathrm{cont}(x) = 1\left[\exists\,\ell\in\mathcal{L} : T_\ell(x) \in \mathcal{S}_{\mathrm{train}}\right]

Here, xx is a benchmark instance, T(x)T_\ell(x) its translation into language \ell, L\mathcal{L} the set of languages considered, and Strain\mathcal{S}_{\mathrm{train}} the (possibly opaque) training corpus of the model (Abbas et al., 21 Jan 2026). Under this threat model, "cross-lingual contamination" is said to occur when a model is exposed to a translated version of the evaluation data, but not the original, allowing memorization of answers without direct English overlap (Yao et al., 2024). Empirical injection of such contamination, for instance by fine-tuning LLMs on translated test sets in French or Arabic, results in measurable performance gains on the original English benchmarks despite the model never seeing the English form (Abbas et al., 21 Jan 2026).

2. Failure Modes of Traditional Contamination Detection

Conventional detection techniques are based on surface similarity between evaluation and training (n-gram overlap, shared-likelihood, semantic matching). For cross-lingual contamination, these probes are fundamentally inadequate:

  • N-Gram Overlap: For T contamination (T being the translated set), n-gram recall between training and evaluation in the source language remains at "clean-model" baseline levels, even as performance is spuriously inflated (Yao et al., 2024).
  • Likelihood-Based Permutation Tests: Shared-likelihood p-values for models contaminated on translated benchmarks show no significant reduction, failing to flag contamination (Yao et al., 2024).
  • Behavioral and Distributional Probes: Methods such as Tested Slot Guessing (TS-Guessing), Index Recall Rate (IDR), and Min-K% probability analysis exhibit dramatic suppression of signal when the model is contaminated via translation. For instance, on Mistral-7B, IDR in English collapses to baseline even with 100% contamination through Arabic, and Min-K% AUROC values indicate near-chance detection (Abbas et al., 21 Jan 2026).

These failures arise because translated contamination evades surface-form matching and disrupts positional or probabilistic cues commonly leveraged in monolingual contamination detection.

3. Translation-Aware Contamination Detection Methodologies

Translation-Aware Contamination Detection (TACD) seeks to recover hidden contamination signals by leveraging multilingual evaluation and cross-variant perturbations:

  • Evaluation Views: For each instance xjx_j, TACD generates translations xj=T(xj)x_j^\ell = T_\ell(x_j) for all L\ell \in \mathcal{L} (e.g., EN, AR, FR). For every language, a random permutation of the answer choices is applied, yielding perturbed views x~j\tilde{x}_j^\ell (Abbas et al., 21 Jan 2026).
  • Model Querying and Signal Extraction:
    • IDR (Index Recall Rate): Proportion where the model correctly recalls the original answer’s index under randomized choice orderings and across all languages. Under pure generalization, this value should be near uniform guessing, $1/K$ for KK-way multiple choice.
    • CLC (Cross-Lingual Consistency): The fraction of examples where predictions agree across all tested languages. For LL languages and KK answer choices, the baseline is 1/KL11/K^{L-1} under independent guessing.
  • Statistical Thresholding: Contamination is flagged if either signal significantly exceeds its baseline, as determined by binomial tests or thresholds calibrated on clean data.

The TACD algorithm can be summarized as follows:

  1. For each evaluation item and each of LL languages, compute a translation and randomize answer order.
  2. Query the model on each variant, recording both the predicted index per view and prediction agreement across views.
  3. Aggregate IDR and CLC scores; flag contamination if these diverge from expectation under independence (Abbas et al., 21 Jan 2026).

4. Empirical Results: Effectiveness and Limitations

TACD has been empirically validated across various backbone models (LLaMA3, Gemma, Qwen, Mistral) and benchmarks (MMLU, ARC-Challenge, MathQA, XQuAD), with contamination introduced via multiple languages (French, Arabic, Spanish, etc.):

  • Recovery of Hidden Signals: While English-only detection probes fail (IDR, Min-K% both near chance), TACD's cross-lingual CLC yields high AUROC (>0.9) for distinguishing contaminated from non-contaminated models (Abbas et al., 21 Jan 2026).
  • Cross-Lingual Consistency Metric: For models contaminated on Arabic translations, CLC climbs far above baseline as contamination proportion increases (e.g., for Gemma-3-1B-it, CLC goes from 0.278 to 0.634 across p=0→100%). Qwen3-1.7B exhibits CLC=1.00 at all p, indicating degenerate collapse observed only via TACD (Abbas et al., 21 Jan 2026).
  • Generalization Benchmark Approach: The generalization-gap metric, Δ = f(Tg)f(T)f(T^g) - f(T), where TgT^g is a version of the test set with all distractor choices replaced by other correct answers, sharply distinguishes genuine generalization from contamination-induced memorization. Clean models show large Δ0Δ ≫ 0, while contaminated models exhibit Δ0Δ ≈ 0 or Δ<0Δ < 0 (Yao et al., 2024).
  • Language-Dependent Effects: European language contamination (French, Spanish, Italian) yields stronger cross-lingual leakage than East Asian languages (Chinese, Japanese, Korean). This suggests a link to subword/token overlap and model input/output interfaces (Yao et al., 2024).

A summary of key signals and their detection power across methods is provided in the following table:

Detection Method Responds to X-Lingual Contam? Detection Power in TACD Setting
N-Gram Overlap No Baseline/Chance
Shared-Likelihood No Baseline/Chance
TS-Guessing No (if evaluated in EN only) Baseline/Chance
Min-K% Probability No (EN only); Yes in TACD High (CLC AUC > 0.9)
TACD (CLC/IDR) Yes High (AUC > 0.9)

5. Impact on LLM Evaluation and Model Development

Translation-aware contamination detection compels a re-examination of evaluation protocols:

  • Multilingual and Translation-Variant Pipelines: Reliable assessment mandates benchmarking across multiple translations and perturbations, not solely in English or a single language (Abbas et al., 21 Jan 2026).
  • Understanding LLM Knowledge Representation: Discrepancies in contamination transfer across languages (e.g., higher for French than for Arabic or Chinese) offer interpretive leverage for studying how LLMs store and access factual and linguistic information (Yao et al., 2024).
  • Low-Cost Multilingual Tuning: The same training pipelines that induce cross-lingual contamination can be repurposed to enhance multilinguality, as continual pre-training on τ(T)\tau_\ell(T) yields boosts not only in the target language but also in English and other intermediate languages (Yao et al., 2024).

The TACD philosophy is broadly applicable beyond QA and MCQ evaluation:

  • Text-to-Code: Masking structural names (e.g., API calls, column names) and measuring reconstruction accuracy (as in DC-Accuracy for text-to-SQL) can serve as a contamination probe across software engineering benchmarks (Ranaldi et al., 2024).
  • Generative/Extractive QA: Masked-token prediction and answer perturbation can be generalized to translation-adjusted evaluation for extractive tasks, using metrics such as exact match and ROUGE-L in each language (Abbas et al., 21 Jan 2026).
  • Machine Translation Benchmarks: Adapting the generalization-gap or choice-confusion paradigms allows robustness analysis in translation tasks where canonical sentence pairs can be masked or structurally altered (Ranaldi et al., 2024, Yao et al., 2024).

In all cases, a strong gap between performance on “public” (potentially contaminated) versus “fresh” splits, or under TACD-style cross-lingual perturbations, constitutes a robust indicator of contamination, surpassing monolingual overlap-based approaches (Ranaldi et al., 2024, Yao et al., 2024, Abbas et al., 21 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Translation-Aware Contamination Detection.