AntiLeak-Bench: Fair LLM Evaluation

Updated 3 February 2026

AntiLeak-Bench is a comprehensive framework that detects and prevents data leakage, ensuring evaluations reflect true model generalization.
It employs advanced metrics like token-level log-probabilities, perplexity tests, and near-duplicate detection to identify exact, semantic, and contextual leaks.
The system integrates automated decontamination and rigorous documentation practices to promote fair, reproducible, and transparent model assessments.

AntiLeak-Bench refers to a collection of methodologies, frameworks, and operational pipelines for detecting, preventing, and mitigating data leakage between training corpora and evaluation benchmarks, particularly in the context of LLMs and, in some settings, quantum information processing. Data leakage undermines the trustworthiness of model evaluation by artificially inflating metrics through memorization of benchmark items encountered during pre-training. AntiLeak-Bench frameworks integrate advanced detection metrics, contamination-resistant benchmark construction, inference-time decontamination, and rigorous documentation to ensure fair and reproducible evaluation.

1. Problem Definition and Taxonomy of Data Leakage

Data leakage encompasses a spectrum of phenomena where evaluation samples—or near-duplicates thereof—appear in a model's pretraining or fine-tuning data, enabling models to pass benchmarks by direct recall or shallow pattern matching rather than genuine generalization. Benchmarks at risk include, but are not limited to, multiple-choice QA (e.g., MMLU, C-Eval), mathematical reasoning (e.g., GSM8K, MATH), code generation (SW, APPS), and structured knowledge tasks.

The primary forms of leakage are:

Exact leakage: The evaluation item is present verbatim in the training dataset.
Semantic leakage: Variant exists in training data such that a semantic-equivalence function $f(x_e, x_p)$ returns $1$ for evaluation $x_e$ and pretraining $x_p$ instances.
Knowledge leakage in context-rich QA: The model's performance is artificially high if $\Pr[a^*|q]$ or $\Pr[a^*|\tilde{q},\mathbf{C}]$ is abnormally elevated without true reliance on provided context.

Contamination can occur through direct inclusion of benchmark repositories, reliance on public challenge-platform data, or lagging curation practices that do not filter out emergent knowledge in static data pools (Zhou et al., 10 Feb 2025, Wu et al., 2024).

2. Detection Metrics and Algorithms

AntiLeak-Bench deployments utilize robust, protocol-driven detection mechanisms suited to the type of benchmark and accessible model resources.

Token-level log-probability anomaly (MC questions): Shuffle all $n!$ permutations of answer options, query $\log P_\mathcal{M}$ for each, and declare a question leaked if the original ordering’s score is a statistical outlier (z-scoring or IsolationForest) (Ni et al., 2024).
Perplexity (PPL) and N-gram Accuracy: For both original and paraphrased (reference) datasets, compute answer PPL and exact match for model-generated $n$ -grams. Substantial drop in performance on reference paraphrases signals overfitting to memorized items (Xu et al., 2024).
Near-duplicate detection using MinHash/LSH: For code and structured data, tokenized 2-gram MinHash vectors with LSH yield candidate duplicates, which are then manually adjudicated for semantic identity (Zhou et al., 10 Feb 2025).
Perturbation tests (long-context QA): Leak detection by context removal, question paraphrase, and contradiction, where any anomalous maintenance of answer probability after perturbation indicates leakage (Fang et al., 21 Jun 2025).

A representative table of detection modes is as follows:

Method	Benchmark Type	Technique
LogP permutation	MC QA	Outlier detection
PPL/N-gram accuracy	Math/coding/generation	Paraphrase test
MinHash + LSH	Code, text corpora	Candidate dupe
Chain-of-Thought perturbs	Long-context QA	Contextual probing

3. Benchmark Decontamination and Automated Construction

Several frameworks extend beyond detection to actively prevent or repair leakage.

Inference-Time Decontamination (ITD): Leaked samples are detected (e.g., via MinKProb), then rewritten using a high-capacity LLM (e.g., GPT-4) to surface-modify questions or options without altering answer or difficulty. The process is iteratively applied until metrics indicate decontamination. The approach is model-agnostic and operates at inference only, without internal access to pretraining data (2406.13990).
Counterfactual rewriting (LastingBench): Upon detection in QA, critical minimal evidence segments within context documents are replaced with counterfactual snippets via instructive prompting, maximizing perplexity gaps to disrupt model memorization while preserving logical structure and inference difficulty (Fang et al., 21 Jun 2025).
AntiLeak-Bench (fully automated, knowledge-driven): Benchmarks are synthesized from timestamped knowledge graphs (e.g., Wikidata), assembling only those samples where newly observed knowledge explicitly postdates the LLM’s known cutoff. This ensures rigorous non-overlap for each model and eliminates manual curation (Wu et al., 2024).
Reference Benchmark Synthesis: When original benchmarks are potentially compromised, new paraphrased instances are automatically generated, and dual-metric analysis performed to cross-validate results (Xu et al., 2024).

4. Quantitative Impact and Auditability

Controlled experiments demonstrate that data leakage can lead to pronounced overestimation of LLM capabilities:

On code generation, Pass@k for StarCoder-7B is inflated by factors of $4.9\times$ to $1$0 for leaked samples versus non-leaked (Zhou et al., 10 Feb 2025).
For GSM8K, ITD reduces accuracy inflation by 22.9% and brings detected leakage from 62.7% to 0.3% in synthetic contamination settings (2406.13990).
On benchmarked QA, revision-based counterfactualization can yield Exact Match score drops from 0.69 (original) to 0.45 (revised) for GPT-4o on HotpotQA, exposing the extent of memorization (Fang et al., 21 Jun 2025).

High-leakage risk is frequently concentrated in models or families with indirect data curation pipelines (e.g., Qwen series), and in benchmarks built from highly public code or question repositories.

Empirical tables and leaderboards tracking leakage across LLMs are available, highlighting models at elevated risk (Qwen2-72B: 42% leak rate on CMB) and those with minimal contamination (GLM4-9B: lowest rates in suite) (Ni et al., 2024, Xu et al., 2024).

5. Practical Deployment, Documentation, and Best Practices

Robust AntiLeak-Bench integration encompasses both automated pipeline steps and transparent reporting:

Automation: Full workflow from data acquisition to benchmark synthesis is script-driven (no human-in-the-loop), relying on APIs and preconfigured templates. For knowledge-based QA, pipelines extract post-cutoff facts, fetch contemporary Wikipedia context, and assemble both single-/multi-hop and multi-choice formats (Wu et al., 2024).
Documentation: Standardized “Benchmark Transparency Card”: includes model release date, pretraining coverage, explicit documentation of benchmark data exposure (e.g., test/train splits used), paraphrasing or augmentation practices, and links to code and raw predictions (Xu et al., 2024).
Continuous integration: CI/CD evaluation must automate paraphrase synthesis, leakage metric computation, and thresholded alerting ($1$1 for N-gram accuracy).
Community benchmarks: Benchmark releases should include both pristine and revised (“defended”) splits to support reproducibility and longitudinal model comparison (Fang et al., 21 Jun 2025).

For LLM providers: removal of known benchmark repositories, fine-grained de-duplication workflows that extend to instruction-tuning data, and the use of detection pipelines at both pretraining and fine-tuning stages are essential (Zhou et al., 10 Feb 2025).

For researchers and users: always employ decontaminated benchmarks where available, contribute to collective labeling/mapping of leaks, and integrate detection pipelines against all accessible pretraining snapshots.

6. Limitations, Open Problems, and Outlook

AntiLeak-Bench frameworks achieve substantive progress in leakage remediation, but are not without limitations:

Detection completeness: Algorithms (e.g., MinKProb, outlier-based LogP) detect only 62–79% of contaminated samples in proof-of-concept (2406.13990). Embedding-based and retrieval-augmented approaches are suggested as directions for improved recall.
Benchmark modality: Most solutions are optimized for MC or extractive QA; code generation and generative modalities demand specialized semantic equivalence and paraphrase strategies.
Automated knowledge extraction: Reliance on structured sources (e.g., Wikidata, Wikipedia) is robust, but not exhaustively up-to-date or error-free (~2–3% errors in context or answer in post-processing audits) (Wu et al., 2024).
Effort vs. coverage: For permutation-based MC detection, $1$2 inference calls can be challenging for large-$1$3 settings, though random subsampling heuristics are available (Ni et al., 2024).
Adversarial retraining and rewriting: For models retrained on decontaminated or paraphrased benchmarks, further cycles of adversarial rewriting may be necessary to maintain difficulty and evaluation fidelity (2406.13990, Fang et al., 21 Jun 2025).

Continued progress will require integrating adversarial sample generation, embedding-based similarity, lightweight paraphrasing, and multi-modal leakage diagnostics, alongside strong transparency practices and community-led benchmark refreshment.

7. Significance and Influence on Model Evaluation Paradigms

The maturation of AntiLeak-Bench methodologies marks a foundational shift toward trustworthy LLM evaluation, ensuring that reported performance reflects model generalization and reasoning rather than corpus overlap. Fully automated, timestamp-driven benchmark generation frameworks set new baselines for contamination resistance and cost-effective maintenance (Wu et al., 2024). The widespread adoption of dual-metric leakage analysis, robust documentation (e.g., Benchmark Transparency Cards), and inference-time decontamination closes key loopholes in benchmark misuse and comparison. These advances harmonize best practices across academic and industrial LLM development, promoting fair competition, community auditing, and alignment between benchmark outcomes and real-world capability deployment (Zhou et al., 10 Feb 2025, 2406.13990, Fang et al., 21 Jun 2025, Wu et al., 2024, Ni et al., 2024, Xu et al., 2024).