From Data Leak to Secret Misses: The Impact of Data Leakage on Secret Detection Models

Published 30 Jan 2026 in cs.CR, cs.AI, and cs.LG | (2601.22946v1)

Abstract: Machine learning models are increasingly used for software security tasks. These models are commonly trained and evaluated on large Internet-derived datasets, which often contain duplicated or highly similar samples. When such samples are split across training and test sets, data leakage may occur, allowing models to memorize patterns instead of learning to generalize. We investigate duplication in a widely used benchmark dataset of hard coded secrets and show how data leakage can substantially inflate the reported performance of AI-based secret detectors, resulting in a misleading picture of their real-world effectiveness.