Cleaning data with Swipe
Abstract: The repair problem for functional dependencies is the problem where an input database needs to be modified such that all functional dependencies are satisfied and the difference with the original database is minimal. The output database is then called an optimal repair. If the allowed modifications are value updates, finding an optimal repair is NP-hard. A well-known approach to find approximations of optimal repairs builds a Chase tree in which each internal node resolves violations of one functional dependency and leaf nodes represent repairs. A key property of this approach is that controlling the branching factor of the Chase tree allows to control the trade-off between repair quality and computational efficiency. In this paper, we explore an extreme variant of this idea in which the Chase tree has only one path. To construct this path, we first create a partition of attributes such that classes can be repaired sequentially. We repair each class only once and do so by fixing the order in which dependencies are repaired. This principle is called priority repairing and we provide a simple heuristic to determine priority. The techniques for attribute partitioning and priority repair are combined in the Swipe algorithm. An empirical study on four real-life data sets shows that Swipe is one to three orders of magnitude faster than multi-sequence Chase-based approaches, whereas the quality of repairs is comparable or better. Moreover, a scalability analysis of the Swipe algorithm shows that Swipe scales well in terms of an increasing number of tuples.
- Foundations of Databases: The Logical Level, 1st ed. Addison-Wesley Longman Publishing Co., Inc., 1995.
- Consistent query answers in inconsistent databases. In Proceedings of the Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (New York, NY, USA, 1999), PODS ’99, Association for Computing Machinery, p. 68–79.
- A proof procedure for data dependencies. Journal of the ACM 31 (1984), 718–741.
- A simple and efficient union-find-delete algorithm. Theoretical Computer Science 412 (02 2011), 487–492.
- Sampling the repairs of functional dependency violations under hard constraints. PVLDB 3 (09 2010), 197–207.
- Cleaning data with selection rules. IEEE Access 10 (2022), 125212–125229.
- Conditional functional dependencies for data cleaning. In Proceedings of the IEEE International Conference on Data Engineering (Istanbul, Turkey, 2007), IEEE, pp. 746–755.
- A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD Conference (Baltimore, Maryland, USA, 2005), ACM, p. 143–154.
- Boskovitz, A. Data Editing and Logic: The covering set method from the perspective of logic. PhD thesis, The Australian National University, 2008.
- Parker: Data fusion through consistent repairs using edit rules under partial keys. Information Fusion 100 (2023), 101942.
- Dynamic repair of categorical data with edit rules. EXPERT SYSTEMS WITH APPLICATIONS 201 (2022), 15.
- Aspects of object merging. In North American Fuzzy Information Processing Society, 2010 Annual meeting (Toronto, Canada, 2010), IEEE, pp. 27–32.
- Holistic data cleaning: Putting violations into context. In Proceedings - International Conference on Data Engineering (Brisbane, QLD, Australia, 04 2013), IEEE, pp. 458–469.
- Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia, 2015), SIGMOD ’15, Association for Computing Machinery, p. 1247–1261.
- Improving data quality: Consistency and accuracy. In VLDB 2007 (Vienna, Austria, 2007), ACM, p. 315–326.
- Nadeef: A commodity data cleaning system. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (New York, New York, USA, 2013), SIGMOD ’13, ACM, p. 541–552.
- Handbook of statistical Data Editing and Imputation. Wiley, 2001.
- Integrating conflicting data: the role of source dependence. In Proceedings of the VLDB conference (Lyon, France, 2009), ACM, pp. 550–561.
- Less is more: Selecting sources wisely for integration. In Proceedings of the VLDB conference (Trento, Italy, 2012), VLDB Endowment, pp. 37–48.
- A survey of data quality measurement and monitoring tools. Frontiers in Big Data 5 (2022), 28.
- Foundations of Data Quality Management. Morgan & Claypool Publishers, 2012.
- A systematic approach to automatic edit and imputation. Journal of the American Statistical Association 71, 353 (1976), 17–35.
- An improved equivalence algorithm. Communications of the ACM 7 (1964), 301–303.
- The LLUNATIC data-cleaning framework. Proc. VLDB Endow. 6, 9 (2013), 625–636.
- Cleaning data with Llunatic. The VLDB Journal 29 (2019), 867–892.
- Properties of functional-dependency families. Journal of the ACM 29, 3 (1982), 678–698.
- Holodetect: Few-shot learning for error detection. In Proceedings of the 2019 International Conference on Management of Data (New York, NY, USA, 2019), SIGMOD ’19, Association for Computing Machinery, p. 829–846.
- Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases 5, 4 (2015), 281–393.
- Data Cleaning. Association for Computing Machinery, New York, NY, USA, 2019.
- On approximating optimum repairs for functional dependency violations. In Proceedings of the 12th International Conference on Database Theory (St. Petersburg, Russia, 2009), ICDT ’09, Association for Computing Machinery, p. 53–62.
- Truth finding on the deep web: Is the problem solved? In Proceedings of the VLDB conference (Trento, Italy, 2013), VLDB Endowment, pp. 97–108.
- Computing optimal repairs for functional dependencies. In Proceedings of the ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems (Houston, Texas, USA, 2018), ACM, p. 225–237.
- Baran: Effective error correction via a unified context representation and transfer learning. In Proceedings of the VLDB Endowment (Tokio, Japan, 2020), vol. 13, VLDB Endowment, p. 1948–1961.
- Raha: A configuration-free error detection system. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands, 2019), SIGMOD ’19, Association for Computing Machinery, p. 865–882.
- Spade: A semi-supervised probabilistic approach for detecting errors in tables. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21 (Montreal, Canada, 2021), International Joint Conferences on Artificial Intelligence Organization, pp. 3543–3551.
- Holoclean: Holistic data repairs with probabilistic inference. In Proceedings of the VLDB Endowment (Munich, Germany, 2017), VLDB Endowment, pp. 1190–1201.
- Pattern-driven data cleaning, 2017.
- Tarjan, R. E. Efficiency of a good but not linear set union algorithm. Journal of the ACM 22 (1975), 215–255.
- Worst-case analysis of set union algorithms. Journal of the ACM 31 (1984), 245–281.
- Warren, H. A modification of warshall’s algorithm for the transitive closure of binary relations. Communications of the ACM 18, 4 (1975), 218–220.
- Wijsen, J. Project-join-repair: An approach to consistent query answering under functional dependencies. In Flexible Query Answering Systems (Berlin, Heidelberg, 2006), H. L. Larsen, G. Pasi, D. Ortiz-Arroyo, T. Andreasen, and H. Christiansen, Eds., Springer Berlin Heidelberg, pp. 1–12.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.