Topology-Aware Subset Repair via Entropy-Guided Density and Graph Decomposition

Published 27 Jan 2026 in cs.DB | (2601.19671v1)

Abstract: Subset repair is an important data cleaning technique that enforces integrity constraints by deleting a minimal number of conflicting tuples, yet multiple minimal repairs often exist. Density-based methods address this ambiguity by favoring repairs that preserve dense, high-quality data regions; however, their effectiveness is limited by density bias from dirty clusters, high computational cost, and uniform attribute weighting. We propose a topology-aware approximate subset repair framework based on a joint density-conflict penalty model. The framework integrates three key components. First, a two-layer conflict detection strategy combines attribute inverted indexes with CFD rule grouping to efficiently identify violations. Second, we introduce EntroCFDensity, a density metric that incorporates information entropy and CFD weights to dynamically adjust attribute importance and reduce homogeneity bias. Third, a conflict degree measure is defined to complement local density, enabling a topology-adaptive penalty mechanism with dynamic weight allocation guided by the coefficient of variation. The conflict graph is further decomposed into independent subgraphs, transforming global repair into tractable local subproblems. Based on this framework, we develop two algorithms: PPIS, a scalable heuristic, and MICO, a mixed-integer programming method with theoretical guarantees. Experimental results show that our approach improves repair accuracy and robustness while effectively preserving high-quality data.