Tested Slot Guessing Method

Updated 28 January 2026

Tested Slot Guessing Method is a probing framework that quantifies contamination by detecting memorized answer patterns rather than true semantic reasoning in language models.
It employs random permutation and selective masking of answer choices to force models into index recall and token recovery tasks, using metrics like Exact-Match and ROUGE-L F1.
The approach extends to multilingual settings with Translation-Aware TS-Guessing, evaluating cross-lingual consistency (CLC) to reveal memorization even when surface cues are neutralized.

A Tested Slot Guessing Method is a behavioral probing framework for quantifying the extent to which LMs, particularly LLMs, rely on memorized positional or lexical patterns—such as answer indices or token orderings—instead of genuine semantic reasoning when solving multiple-choice or extractive question-answering (QA) tasks. It offers robust contamination signals in both monolingual and multilingual settings, especially when conventional surface-form overlap metrics fail to diagnose data leakage or memorization.

1. Formal Definition and Theoretical Foundations

Tested Slot Guessing (TS-Guessing) operationalizes contamination detection by leveraging forced slot-filling tasks. For multiple-choice items, this involves randomly permuting the $K$ answer choices and masking the content of one (often an incorrect) option. The model is then tasked to recover either the original index of the masked choice (index recall) or to generate the token sequence corresponding to the masked content (token recovery). The core proposition is that models trained or contaminated on the original task distribution will exploit memorized answer-pattern priors, which manifest as above-chance accuracy in these reconstruction tasks, even when semantic cues are minimal or redacted.

Mathematically, for $N$ test items,

$\mathrm{IDR} = \frac{1}{N}\sum_{j=1}^{N}\mathbb{I}\bigl[\hat y^{(j)}_{\mathrm{shuf}} = y^{(j)}_{\mathrm{orig}}\bigr],$

where $\hat y^{(j)}_{\mathrm{shuf}}$ is the index predicted by the model after permutation, and $y^{(j)}_{\mathrm{orig}}$ is the true index in the original pre-shuffle configuration. Additional metrics include Exact-Match (EM) and ROUGE-L F1 when the task is framed as token recovery (Abbas et al., 21 Jan 2026).

2. Methodological Implementation and Protocols

The tested slot guessing protocol consists of: (1) random permutation of choice order; (2) selective masking; (3) prompting of the model to recover either the index or token span; and (4) aggregation of accuracy or overlap measures across the dataset. To suppress trivial exploitation of fixed answer format conventions (e.g., “the answer is always C”), an essential extension is to introduce a per-instance re-ordering of choices before masking (Abbas et al., 21 Jan 2026). This enforces that any above-baseline signal in index recall must originate from underlying memorization.

The stepwise algorithm can be summarized as:

For each instance $(q, \{c_1, \ldots, c_K\}, y)$ $(q, {c_{1}, \dots, c_{K}}, y)$ :
1. Draw a random permutation $\pi$ of $\{1, \dots, K\}$ .
2. Reorder: $\tilde{c}_i = c_{\pi(i)}$ .
3. Select and mask one $\tilde{c}_m$ .
4. Prompt: “Question: $N$ 0. Choices: $N$ 1, ..., [MASK], ..., $N$ 2.”
5. Query model for (a) the masked text or (b) the original index letter.
6. Score as “correct” if prediction recovers the original masked slot.

Index Recall Rate and token-level EM or ROUGE-L are computed as the evaluation metrics (Abbas et al., 21 Jan 2026).

3. Behavioral and Distributional Contamination Signals

Tested slot guessing directly probes for contamination-induced behavioral artifacts. In non-contaminated (“clean”) models, performance on both index-recall and masked-token recovery closely approximates random guessing, especially when all positional and lexical cues are neutralized by permutation. In contaminated models, index recall is significantly above the $N$ 3 random baseline, and token recovery metrics spike for test items present in pretraining (Abbas et al., 21 Jan 2026).

Complementary to behavioral probing, the Min-K% metric quantifies the minimal probability assigned to the top $N$ 4 predictions under the model’s output distribution for each input. This is defined as: $N$ 5 for a given input $N$ 6. Lower Min-K% on a suspected benchmark compared to held-out data is indicative of over-confident, memorized outputs and hence possible contamination (Abbas et al., 21 Jan 2026).

4. Translation-Aware and Multilingual TS-Guessing Extensions

Surface-form based detectors are insufficient for catching cross-lingual contamination—memorization that persists across translations of benchmarks into non-English languages (e.g., Arabic, French, Chinese). To address this, Translation-Aware TS-Guessing applies the probe on multiple independently translated variants of the same benchmark, combined with random re-orderings for each language (Abbas et al., 21 Jan 2026, Yao et al., 2024).

The Translation-Aware Contamination Detection (TACD) protocol compares two novel signals:

Index Recall per language ( $N$ 7): Computed as in the base TS-Guessing probe for each language $N$ 8.
Cross-Lingual Consistency (CLC): Fraction of test cases where the model outputs the same index/answer across all language views for a permuted, masked input.

Formally, for $N$ 9 languages,

$\mathrm{IDR} = \frac{1}{N}\sum_{j=1}^{N}\mathbb{I}\bigl[\hat y^{(j)}_{\mathrm{shuf}} = y^{(j)}_{\mathrm{orig}}\bigr],$ 0

Random guessing leads to $\mathrm{IDR} = \frac{1}{N}\sum_{j=1}^{N}\mathbb{I}\bigl[\hat y^{(j)}_{\mathrm{shuf}} = y^{(j)}_{\mathrm{orig}}\bigr],$ 1. Substantially higher values—especially monotonic growth with increased “poisoned” (contaminated) exposure—flag contamination persisting in cross-lingual representations (Abbas et al., 21 Jan 2026).

5. Empirical Findings, Limitations, and Best Practices

Empirical evaluation reveals that:

Under monolingual contamination, models exhibit index recall rates well above random and low Min-K% for contaminated benchmarks.
Translation of benchmarks into other languages (e.g., English → Arabic) often neutralizes English-only slot-guessing and Min-K% detectors; index recall can fall to baseline. However, translation-invariant memorization is uncovered by significant rises in CLC, demonstrating cross-lingual consistency patterns that synthetic random models would not achieve (Abbas et al., 21 Jan 2026, Yao et al., 2024).
TACD exposes contamination patterns across various LLMs—such as LLaMA, Qwen, Gemma—where standard detectors are silent. For instance, in LLaMA-1B, CLC rises from 0.001 (0% poison) to 0.171 (100% Arabic exposure), even when IDR remains at random levels (Abbas et al., 21 Jan 2026).

A summary of reported signals for three models across Arabic contamination levels:

Model	Poison %	IDR	CLC
LLaMA-1B	0%	0.145	0.001
	10%	0.139	0.107
	50%	0.222	0.176
	100%	0.186	0.171
Qwen-1.7B	0%	0.229	1.000
Gemma-1B	0%	0.242	0.278
	100%	0.239	0.634

These patterns show that cross-lingual consistency is a sensitive marker of contamination even when monolingual index recall is uninformative.

Practical considerations include the need for high-fidelity translation, threshold calibration for statistical decision rules (e.g., comparing signals against $\mathrm{IDR} = \frac{1}{N}\sum_{j=1}^{N}\mathbb{I}\bigl[\hat y^{(j)}_{\mathrm{shuf}} = y^{(j)}_{\mathrm{orig}}\bigr],$ 2 and $\mathrm{IDR} = \frac{1}{N}\sum_{j=1}^{N}\mathbb{I}\bigl[\hat y^{(j)}_{\mathrm{shuf}} = y^{(j)}_{\mathrm{orig}}\bigr],$ 3 for random baselines), and computational overhead proportional to the number of language views (Abbas et al., 21 Jan 2026).

6. Broader Impact and Integration with Other Contamination Detection Strategies

Tested Slot Guessing methods should be used in conjunction with other contamination detection approaches such as schema-reconstruction probes (Ranaldi et al., 2024), behavioral and distributional outlier detection, and continual benchmarking reformulation. They are crucial for certifying model evaluation authenticity—especially in multilingual and cross-lingual contexts—and for stress-testing LLMs against reliance on spurious cues or memorized structural patterns.

Translation-Aware TS-Guessing (as part of TACD) is particularly suited for identifying memorization that is robust to paraphrasing, translation, and other surface-level transformations of benchmarks. As LLMs become increasingly multilingual, maintaining and interpreting a battery of such cross-lingual probes is essential for fair, reproducible evaluation and for guiding the development of benchmark-invisible assessment protocols (Abbas et al., 21 Jan 2026, Yao et al., 2024).