Papers
Topics
Authors
Recent
Search
2000 character limit reached

Step-by-Step Data Cleaning Recommendations to Improve ML Prediction Accuracy

Published 14 Mar 2025 in cs.DB | (2503.11366v1)

Abstract: Data quality is crucial in ML applications, as errors in the data can significantly impact the prediction accuracy of the underlying ML model. Therefore, data cleaning is an integral component of any ML pipeline. However, in practical scenarios, data cleaning incurs significant costs, as it often involves domain experts for configuring and executing the cleaning process. Thus, efficient resource allocation during data cleaning can enhance ML prediction accuracy while controlling expenses. This paper presents COMET, a system designed to optimize data cleaning efforts for ML tasks. COMET gives step-by-step recommendations on which feature to clean next, maximizing the efficiency of data cleaning under resource constraints. We evaluated COMET across various datasets, ML algorithms, and data error types, demonstrating its robustness and adaptability. Our results show that COMET consistently outperforms feature importance-based, random, and another well-known cleaning method, achieving up to 52 and on average 5 percentage points higher ML prediction accuracy than the proposed baselines.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.