Papers
Topics
Authors
Recent
Search
2000 character limit reached

DataLens: ML-Oriented Interactive Tabular Data Quality Dashboard

Published 28 Jan 2025 in cs.DB | (2501.17074v1)

Abstract: Maintaining high data quality is crucial for reliable data analysis and ML. However, existing data quality management tools often lack automation, interactivity, and integration with ML workflows. This demonstration paper introduces DataLens, a novel interactive dashboard designed to streamline and automate the data quality management process for tabular data. DataLens integrates a suite of data profiling, error detection, and repair tools, including statistical, rule-based, and ML-based methods. It features a user-in-the-loop module for interactive rule validation, data labeling, and custom rule definition, enabling domain experts to guide the cleaning process. Furthermore, DataLens implements an iterative cleaning module that automatically selects optimal cleaning tools based on downstream ML model performance. To ensure reproducibility, DataLens generates DataSheets capturing essential metadata and integrates with MLflow and Delta Lake for experiment tracking and data version control. This demonstration showcases DataLens's capabilities in effectively identifying and correcting data errors, improving data quality for downstream tasks, and promoting reproducibility in data cleaning pipelines.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.