Translation-Aware Contamination Detection

Updated 28 January 2026

Translation-Aware Contamination Detection is a method that accounts for multilanguage data leaks by analyzing translations of benchmark items.
It employs metrics like Cross-Lingual Consistency (CLC) and Index Recall Rate (IDR) to identify hidden contamination that inflates model performance.
Empirical results demonstrate high detection power (AUC > 0.9) and reveal language-dependent effects, guiding improvements in LLM evaluation.

Translation-Aware Contamination Detection refers to a class of contamination detection methodologies that explicitly account for the possibility of benchmark data leakage through translations, rather than only considering surface-level overlap in the source language. This is of critical importance in multilingual LLM evaluation, as translations of benchmarks can enter model pre-training corpora, thereby artificially inflating performance on the original task in the source language, while escaping traditional detection techniques focused on monolingual or surface-form overlap. Recent studies have demonstrated both the existence of cross-lingual contamination effects and the necessity of translation-aware detection protocols to ensure rigorous, fair, and reproducible assessment of LLM generalization (Yao et al., 2024, Abbas et al., 21 Jan 2026).

1. Foundations: Cross-Lingual Contamination and Its Formalization

Data contamination occurs when evaluation items or close variants are present in a model’s training data, thereby undermining claims of generalization. Traditional detection focuses on direct overlap, e.g., n-gram or token matching. In the translation-aware context, the contamination indicator is revised to account for translations:

$\mathrm{cont}(x) = 1\left[\exists\,\ell\in\mathcal{L} : T_\ell(x) \in \mathcal{S}_{\mathrm{train}}\right]$

Here, $x$ is a benchmark instance, $T_\ell(x)$ its translation into language $\ell$ , $\mathcal{L}$ the set of languages considered, and $\mathcal{S}_{\mathrm{train}}$ the (possibly opaque) training corpus of the model (Abbas et al., 21 Jan 2026). Under this threat model, "cross-lingual contamination" is said to occur when a model is exposed to a translated version of the evaluation data, but not the original, allowing memorization of answers without direct English overlap (Yao et al., 2024). Empirical injection of such contamination, for instance by fine-tuning LLMs on translated test sets in French or Arabic, results in measurable performance gains on the original English benchmarks despite the model never seeing the English form (Abbas et al., 21 Jan 2026).

2. Failure Modes of Traditional Contamination Detection

Conventional detection techniques are based on surface similarity between evaluation and training (n-gram overlap, shared-likelihood, semantic matching). For cross-lingual contamination, these probes are fundamentally inadequate:

N-Gram Overlap: For T^ℓ contamination (T^ℓ being the translated set), n-gram recall between training and evaluation in the source language remains at "clean-model" baseline levels, even as performance is spuriously inflated (Yao et al., 2024).
Likelihood-Based Permutation Tests: Shared-likelihood p-values for models contaminated on translated benchmarks show no significant reduction, failing to flag contamination (Yao et al., 2024).
Behavioral and Distributional Probes: Methods such as Tested Slot Guessing (TS-Guessing), Index Recall Rate (IDR), and Min-K% probability analysis exhibit dramatic suppression of signal when the model is contaminated via translation. For instance, on Mistral-7B, IDR in English collapses to baseline even with 100% contamination through Arabic, and Min-K% AUROC values indicate near-chance detection (Abbas et al., 21 Jan 2026).

These failures arise because translated contamination evades surface-form matching and disrupts positional or probabilistic cues commonly leveraged in monolingual contamination detection.

3. Translation-Aware Contamination Detection Methodologies

Translation-Aware Contamination Detection (TACD) seeks to recover hidden contamination signals by leveraging multilingual evaluation and cross-variant perturbations:

Evaluation Views: For each instance $x_j$ , TACD generates translations $x_j^\ell = T_\ell(x_j)$ for all $\ell \in \mathcal{L}$ (e.g., EN, AR, FR). For every language, a random permutation of the answer choices is applied, yielding perturbed views $\tilde{x}_j^\ell$ (Abbas et al., 21 Jan 2026).
Model Querying and Signal Extraction:
- IDR (Index Recall Rate): Proportion where the model correctly recalls the original answer’s index under randomized choice orderings and across all languages. Under pure generalization, this value should be near uniform guessing, $x$ 0 for $x$ 1-way multiple choice.
- CLC (Cross-Lingual Consistency): The fraction of examples where predictions agree across all tested languages. For $x$ 2 languages and $x$ 3 answer choices, the baseline is $x$ 4 under independent guessing.
Statistical Thresholding: Contamination is flagged if either signal significantly exceeds its baseline, as determined by binomial tests or thresholds calibrated on clean data.

The TACD algorithm can be summarized as follows:

For each evaluation item and each of $x$ 5 languages, compute a translation and randomize answer order.
Query the model on each variant, recording both the predicted index per view and prediction agreement across views.
Aggregate IDR and CLC scores; flag contamination if these diverge from expectation under independence (Abbas et al., 21 Jan 2026).

4. Empirical Results: Effectiveness and Limitations

TACD has been empirically validated across various backbone models (LLaMA3, Gemma, Qwen, Mistral) and benchmarks (MMLU, ARC-Challenge, MathQA, XQuAD), with contamination introduced via multiple languages (French, Arabic, Spanish, etc.):

Recovery of Hidden Signals: While English-only detection probes fail (IDR, Min-K% both near chance), TACD's cross-lingual CLC yields high AUROC (>0.9) for distinguishing contaminated from non-contaminated models (Abbas et al., 21 Jan 2026).
Cross-Lingual Consistency Metric: For models contaminated on Arabic translations, CLC climbs far above baseline as contamination proportion increases (e.g., for Gemma-3-1B-it, CLC goes from 0.278 to 0.634 across p=0→100%). Qwen3-1.7B exhibits CLC=1.00 at all p, indicating degenerate collapse observed only via TACD (Abbas et al., 21 Jan 2026).
Generalization Benchmark Approach: The generalization-gap metric, Δ = $x$ 6, where $x$ 7 is a version of the test set with all distractor choices replaced by other correct answers, sharply distinguishes genuine generalization from contamination-induced memorization. Clean models show large $x$ 8, while contaminated models exhibit $x$ 9 or $T_\ell(x)$ 0 (Yao et al., 2024).
Language-Dependent Effects: European language contamination (French, Spanish, Italian) yields stronger cross-lingual leakage than East Asian languages (Chinese, Japanese, Korean). This suggests a link to subword/token overlap and model input/output interfaces (Yao et al., 2024).

A summary of key signals and their detection power across methods is provided in the following table:

Detection Method	Responds to X-Lingual Contam?	Detection Power in TACD Setting
N-Gram Overlap	No	Baseline/Chance
Shared-Likelihood	No	Baseline/Chance
TS-Guessing	No (if evaluated in EN only)	Baseline/Chance
Min-K% Probability	No (EN only); Yes in TACD	High (CLC AUC > 0.9)
TACD (CLC/IDR)	Yes	High (AUC > 0.9)

5. Impact on LLM Evaluation and Model Development

Translation-aware contamination detection compels a re-examination of evaluation protocols:

Multilingual and Translation-Variant Pipelines: Reliable assessment mandates benchmarking across multiple translations and perturbations, not solely in English or a single language (Abbas et al., 21 Jan 2026).
Understanding LLM Knowledge Representation: Discrepancies in contamination transfer across languages (e.g., higher for French than for Arabic or Chinese) offer interpretive leverage for studying how LLMs store and access factual and linguistic information (Yao et al., 2024).
Low-Cost Multilingual Tuning: The same training pipelines that induce cross-lingual contamination can be repurposed to enhance multilinguality, as continual pre-training on $T_\ell(x)$ 1 yields boosts not only in the target language but also in English and other intermediate languages (Yao et al., 2024).

The TACD philosophy is broadly applicable beyond QA and MCQ evaluation:

Text-to-Code: Masking structural names (e.g., API calls, column names) and measuring reconstruction accuracy (as in DC-Accuracy for text-to-SQL) can serve as a contamination probe across software engineering benchmarks (Ranaldi et al., 2024).
Generative/Extractive QA: Masked-token prediction and answer perturbation can be generalized to translation-adjusted evaluation for extractive tasks, using metrics such as exact match and ROUGE-L in each language (Abbas et al., 21 Jan 2026).
Machine Translation Benchmarks: Adapting the generalization-gap or choice-confusion paradigms allows robustness analysis in translation tasks where canonical sentence pairs can be masked or structurally altered (Ranaldi et al., 2024, Yao et al., 2024).

In all cases, a strong gap between performance on “public” (potentially contaminated) versus “fresh” splits, or under TACD-style cross-lingual perturbations, constitutes a robust indicator of contamination, surpassing monolingual overlap-based approaches (Ranaldi et al., 2024, Yao et al., 2024, Abbas et al., 21 Jan 2026).