Scale and significance of semantic-duplicate contamination in LLM training corpora

Determine the scale (prevalence) and significance (impact on evaluation outcomes) of contamination of large language model training corpora by semantic duplicates of benchmark test items—i.e., training examples whose substantive meaning matches benchmark items despite low or no syntactic overlap—and quantify how such contamination affects benchmark validity and interpretation.

Background

The paper distinguishes between exact duplicates (syntactic matches) and semantic duplicates (content-equivalent items with little or no n-gram overlap) and argues that semantic duplicates are much harder to detect and remove with standard decontamination filters. Because modern training corpora are vast and heterogeneous, semantic duplicates of benchmark items can enter training data through parallel creation or aggregation pipelines.

While prior work has documented the existence of semantic duplicates and their robustness to n-gram-based decontamination, the authors note that the broader question of how widespread this phenomenon is and how much it matters for benchmark integrity has not been comprehensively resolved. Their empirical study on Olmo3 provides partial evidence but explicitly highlights the need to establish both prevalence and practical impact more generally.

References

Research on these semantic duplicates has often focused on their robustness to standard, n-gram based ‘decontamination’ methods, but the scale and significance of the phenomenon remains a mostly open question.

Soft Contamination Means Benchmarks Test Shallow Generalization  (2602.12413 - Spiesberger et al., 12 Feb 2026) in Related Work