Papers
Topics
Authors
Recent
Search
2000 character limit reached

Domain-Driven Data Curation

Updated 21 January 2026
  • Domain-Driven Data Curation is a specialized approach that embeds expert knowledge into data selection, cleaning, and enrichment processes.
  • It employs human-in-the-loop methods, semantic techniques, and agent-driven pipelines to meet context-specific quality and relevance needs.
  • Optimization frameworks and hybrid workflows balance automation with expert feedback, enhancing data utility across diverse sectors.

Domain-driven data curation refers to the set of data selection, cleaning, enrichment, annotation, and integration practices that are contextually embedded within a specific professional or scientific domain, leveraging the tacit and explicit knowledge of domain experts to increase the relevance and usefulness of curated data. Unlike general-purpose curation approaches, which treat data as generic objects to be processed according to standardized workflow or FAIR (Findable, Accessible, Interoperable, Reusable) principles, domain-driven strategies recognize and operationalize local practices, needs, error modalities, and data-use scenarios across diverse application domains. This paradigm manifests in settings ranging from facility management and digital humanities to biomedical informatics, web-scale language modeling, and speech enhancement, each harnessing domain-specific expertise to tailor curation logic, infrastructure, and evaluation metrics (Sporsem et al., 2021, Leavy et al., 2023, Caufield et al., 2024, Wettig et al., 14 Feb 2025, Chen et al., 2023, Li et al., 30 Jun 2025).

1. Distinction Between General and Domain-Driven Data Curation

General data curation workflows are lifecycle-centered, covering data collection, cleaning, storage, dissemination, and preservation. These workflows often enforce technology-agnostic standards (e.g., schema templates, anonymization protocols, generic quality metrics) and prioritize broad interoperability, sometimes at the expense of local utility.

Domain-driven data curation, by contrast, internalizes domain- and task-specific requirements into every stage of the workflow. Frontline experts—such as janitors (facility management), humanities scholars (literary curation), biomedical curators (biocuration), or data scientists—shape curation decisions based on local operational context, error typologies, and reuse scenarios. This yields the following contrasts:

Aspect General Curation Domain-Driven Curation
Context awareness Minimal, template-driven High, scenario explicit
Task allocation Centralized data team Embedded in domain practice
Quality criteria Generic (e.g., missing values) Domain-specific, nuanced
Tooling One-size-fits-all Locally adapted, flexible

This contextual embedding enables curation choices that maximize domain value but may diverge from standard protocol, for example by selectively ignoring data considered "noise" in a particular facility management context (Sporsem et al., 2021).

2. Domain-Driven Methods and Architectures

Across application areas, domain-driven data curation is realized through architecture and workflow designs that integrate domain ontologies, human-in-the-loop mechanisms, agentic reasoning, and hybrid automation.

  • Human-centered curation in facility management: Janitors curate building data by correcting, filtering, or enriching records based on routine observations, often working outside formal templates and using ad-hoc tools (e.g., Excel lists, annotated photos) (Sporsem et al., 2021).
  • Semantic curation in digital humanities: Platforms such as Curatr employ neural word embeddings (CBOW, Word2Vec) combined with domain-expert-guided lexicon expansion to extract meaningfully themed sub-corpora from large digitized text collections (Leavy et al., 2023).
  • Agent-driven biocuration: CurateGPT orchestrates modular LLM-driven agents that search, extract, annotate, and validate biomedical knowledge in close integration with domain ontologies, standard schemas (LinkML), and knowledge bases. Retrieval-augmented generation (RAG) and vector indexing enable semantic reasoning and evidence linking (Caufield et al., 2024).
  • Web-scale domain construction: WebOrganizer applies a two-layer taxonomy (topics and formats), with LLM-distilled classifiers labeling massive web corpora. Downstream mixture optimization techniques (RegMix) determine optimal domain combinations for pre-training and transfer (Wettig et al., 14 Feb 2025).
  • LLM-as-compiler paradigm: SEED compiles user-provided task descriptions into optimized hybrid data curation pipelines, combining LLM-generated code, cache reuse, fine-tuned small models, and direct LLM calls, adaptively minimizing cost while tuning for domain-specific accuracy (Chen et al., 2023).
  • Speech enhancement pipeline: Data curation leverages domain knowledge of label pathologies; neural quality metrics and multi-stage filtering isolate high-fidelity speech data, emphasizing that carefully curated subsets outperform indiscriminately scaled large datasets (Li et al., 30 Jun 2025).

3. Analytical Models and Optimization Frameworks

To formalize data curation activities and optimize curation pipelines, several analytical and optimization frameworks have been proposed:

  • Curation activity taxonomy: Following Parmiggiani & Grisot's (2020) model, domain curation practices can be decomposed into PDC={P1,P2,P3}\mathcal{P}_{\mathrm{DC}} = \{P_1, P_2, P_3\}, where P1P_1 = Achieve Quality, P2P_2 = Filter Relevance, and P3P_3 = Ensure Protection. Domain studies may realize only a subset, e.g., facility management data practices mapped to P1P_1 and P2P_2 (Sporsem et al., 2021).
  • Mixture model optimization: Web domain curation tasks are abstracted as a mixture model:

pmix(x)=∑d=1Dwd pd(x)p_{\mathrm{mix}}(x) = \sum_{d=1}^D w_{d} \, p_{d}(x)

where wdw_d are mixture weights for domain dd. RegMix predicts optimal ww by regressing LM validation loss on sampled mixtures, subject to upsampling/sparsity constraints (Wettig et al., 14 Feb 2025).

  • Hybrid pipeline search: SEED models pipeline cost C(P)C(P) and accuracy A(P)A(P) for curation tasks, minimizing C(P)C(P) while constraining A(P)≥A(P∗)−GA(P) \geq A(P^*) - G for user-tolerated gap GG, using dynamic programming and subplan skyline pruning (Chen et al., 2023).
  • Speech data ranking: Quality measures si,ms_{i,m} across multiple neural MOS predictors are normalized and summed to produce a unified ranking score Si=∑ms~i,mS_i = \sum_{m} \tilde{s}_{i,m} that drives top-X selection (Li et al., 30 Jun 2025).

4. Engagement of Domain Experts and Human-in-the-Loop Design

Domain-driven curation requires purposeful engagement of practitioners whose expertise is critical for data relevance and accuracy:

  • Invisible work and motivation: When practitioners' curation work is rendered invisible—treated as mere data acquisition rather than knowledge curation—engagement collapses, resulting in lower data quality and limited reuse. Feedback, purpose transparency, and participatory infrastructure design are essential interventions (Sporsem et al., 2021).
  • Interactive lexicon expansion: In digital humanities, expert-in-the-loop workflows enable iterative lexicon/embedding refinement and transparent review, preventing semantic drift and surfacing novelty (Leavy et al., 2023).
  • Hybrid agent orchestration: Biocuration architectures utilize agents for search, extraction, validation, and knowledge integration, combining external evidence and structured ontologies with LLM reasoning, while curators review outputs and resolve ambiguities (Caufield et al., 2024).
  • LLM-powered automation with user control: SEED’s compiler leverages LLM-generated code and models but always allows for fallback and user inspection, with module order and thresholds optimized per task (Chen et al., 2023).

5. Domain-Specific Quality Assessment and Evaluation

Unlike domain-agnostic pipelines, domain-driven curation employs specialized metrics and selection procedures aligned with characteristic error modalities and downstream requirements:

  • Contextual quality filters: In speech enhancement, neural estimators capturing speech-specific artifacts (e.g., DNSMOS, NISQA, SIGMOS, UTMOS, SQUIM-SDR) are used for systematic filtering and ranking, outperforming random or volume-maximizing approaches (Li et al., 30 Jun 2025).
  • Tailored relevance and novelty: In humanities curation, domain expert judgment and measures of content/novelty relevance guide sub-corpus selection, whose effectiveness is primarily reflected by increased diversity of retrieved authors and themes, rather than precision/recall (Leavy et al., 2023).
  • Evidence-linked annotation: Biocuration workflows report recall, precision, F1 metrics, and curator throughput (entries/minute), tracking both efficiency and accuracy gains relative to manual and direct LLM-only approaches (Caufield et al., 2024).
  • Hybrid accuracy/cost frontier: SEED documents experimental optimization over curation accuracy and LLM call budget, with reported improvements in both overall error reduction and resource consumption through domain-calibrated pipelines (Chen et al., 2023).
  • Domain-compliant class mixtures: Web curation evaluates marginal and joint effects of topic and format balancing, with downstream task metrics (e.g., average accuracy on MMLU, HellaSwag) revealing task-dependent optimal mixtures (Wettig et al., 14 Feb 2025).

6. Best Practices, Challenges, and Generalization

Several cross-domain insights and best practices have emerged:

  • Make curation visible: Recognize and reward expert involvement; integrate feedback loops and participatory design throughout the data infrastructure (Sporsem et al., 2021).
  • Iterate curation and retrieval: Human refinement of lexicons and result sets should be cyclic, facilitating continual adaptation as new information or user needs arise (Leavy et al., 2023).
  • Architect for flexibility and transparency: Expose all relevant parameters, allow ad-hoc tool integration, and document provenance and agent operations (Caufield et al., 2024, Chen et al., 2023).
  • Leverage multi-metric quality fusion: Fuse and normalize contextually relevant metrics for filtering and ranking, rather than relying on any single indicator (Li et al., 30 Jun 2025).
  • Optimize for task and domain: Employ mixture modeling, agent orchestration, and hybrid code/LLM generation calibrated for each use case; regular re-optimization enhances both efficiency and domain fit (Wettig et al., 14 Feb 2025, Chen et al., 2023).
  • Plan for evolution: Domains, error modalities, and data-use scenarios evolve rapidly, necessitating flexible curation workflow and infrastructure design.

A plausible implication is that future advances in domain-driven data curation will increasingly couple automated machine learning models, generative agents, and expert-in-the-loop processes, sustaining both scale and accuracy while preserving the interpretability and trust required for high-impact applications. The domain-driven paradigm has proven effective and adaptable across a spectrum of sectors, suggesting broad transferability given sufficient contextualization and user engagement.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Domain-Driven Data Curation.