Scale and significance of semantic-duplicate contamination in LLM training corpora
Determine the scale (prevalence) and significance (impact on evaluation outcomes) of contamination of large language model training corpora by semantic duplicates of benchmark test items—i.e., training examples whose substantive meaning matches benchmark items despite low or no syntactic overlap—and quantify how such contamination affects benchmark validity and interpretation.
References
Research on these semantic duplicates has often focused on their robustness to standard, n-gram based ‘decontamination’ methods, but the scale and significance of the phenomenon remains a mostly open question.
— Soft Contamination Means Benchmarks Test Shallow Generalization
(2602.12413 - Spiesberger et al., 12 Feb 2026) in Related Work