- The paper’s main contribution is its framework for data-centric AI, emphasizing systematic dataset engineering over traditional model-centric approaches.
- It details two dimensions—data refinement and extension—providing methodologies to enhance data quality and address informational gaps in ML systems.
- The study explores implications for business and systems engineering, advocating continuous data governance and collaborative data sharing.
Data-Centric Artificial Intelligence
Introduction to Data-Centric AI
The concept of data-centric AI has emerged as a complementary paradigm to the traditionally model-centric approach in ML. This paper delineates data-centric AI, which emphasizes the systematic design and engineering of datasets as fundamental to optimizing AI systems. Unlike model-centric AI that primarily focuses on developing sophisticated algorithms and architectures, data-centric AI advocates for improving the quality and quantity of data as the primary driver of ML performance enhancements. With roots reinforced by Andrew Ng and supported through various workshops, this paradigm shift acknowledges the critical importance of data in achieving effective AI system deployment, particularly in real-world applications where public datasets and pre-trained models may not be available.
Figure 1: Data-centric AI as an emerging, complementary paradigm for the development of AI-based systems.
Dimensions of Data-Centric AI
This paper introduces a framework for data-centric AI, detailing two primary dimensions: data refinement and data extension.
- Data Refinement: This involves enhancing existing data quality. Key activities encompass improving feature and label quality, increasing the representation of high-relevance instances, and identifying and excluding low-quality data entries. Data refinement is increasingly supported by semi-automated tools, which assist in differentiating between outliers (to be removed) and edge cases (to be retained and augmented).
- Data Extension: This dimension focuses on acquiring additional data to fill informational "blind spots" that limit the model’s initial performance or cause distributional shifts over time. Methods for data extension include acquiring additional instances, collecting new features, and obtaining target labels for unlabeled instances.
These dimensions are visualized in the proposed framework for data-centric AI.
Figure 2: Framework for the systematic design and engineering of data for data-centric AI.
Data-centric AI is distinct yet interconnected with several established concepts such as Big Data, MLOps, and data-driven approaches:
- Big Data: While both paradigms emphasize data accumulation to improve analytics, data-centric AI specifically focuses on the appropriateness and systematic curation of data rather than mere volume.
- MLOps: This practice encompasses the deployment of AI systems but has traditionally underemphasized the roles of monitoring and managing data sets. Data-centric AI thus enhances MLOps by introducing nuanced data management tactics necessary for iterative improvement of AI models.
- Data-Driven Methods: The paradigm does not replace but operates alongside model-driven approaches by emphasizing a data-centric lifecycle that includes data refinement and extension to create value.
The integration of data-centric AI poses several implications for the BISE community across individual, organizational, and cross-organizational levels:
- Individual Level: Highlights include the need for advanced data visualization and exploration tools that leverage domain knowledge to improve data understanding and refinement. Human-in-the-loop systems are crucial to align data work with domain-specific insights.
- Organizational Level: Organizations must adopt continuous monitoring processes to ensure data quality and relevance, central to sustaining AI model performance. Effective data governance frameworks, including the extension of CRISP-DM, delineate pathways for iterative data work and AI system development.
- Cross-Organizational Level: Data sharing initiatives require development of infrastructures supporting data exchange, underpinned by standardized practices and governance. Ensuring fairness and collaboration across entities enhances the value derived from federated learning approaches.
Figure 3: Proposed areas of BISE research for the advancement of data-centric AI.
Figure 4: Extending the Cross Industry Standard Processes for Data Mining (CRISP-DM) based on considerations from data-centric AI.
Conclusion
Data-centric AI introduces a paradigm shift essential for advancing the efficacy of AI systems across various domains. This approach accentuates the necessity of refined data work and proactive governance to harness high-quality datasets, fostering enhanced performance and application of AI technologies. It suggests evolved methodologies and underscores the potential synergies between data-centric strategies and traditional model-centric approaches, marking a suite of opportunities for both academic inquiry and practical implementation in AI development.