Papers
Topics
Authors
Recent
Search
2000 character limit reached

Domain-Informed Synthetic Data

Updated 10 February 2026
  • Domain-informed synthetic data are synthetic datasets generated by integrating explicit expert knowledge, structured constraints, and causal models to mirror real-world phenomena.
  • They enable practical applications in privacy-critical analytics, scientific modeling, and high-stakes decision-making through methods like rule-based encoding, indicator prompts, and domain randomization.
  • Real-world implementations, such as blockchain cyberattack detection, NLI domain generalization, and medical imaging segmentation, showcase significant gains in constraint satisfaction and performance.

Domain-informed synthetic data refers to synthetic datasets whose generation is guided or constrained by explicit expert knowledge, structured domain rules, or inductive biases reflective of salient properties in a target application domain. This class of synthetic data generation methods contrasts with generic, purely data-driven or unsupervised approaches by systematically encoding, extracting, or enforcing semantic signals, regulatory requirements, causality, or physical laws into the generative pipeline. Domain-informed approaches have been demonstrated to be essential for realism, utility, fairness, privacy, and out-of-domain generalization in a wide range of settings, including scientific modeling, high-stakes decision-making, and privacy-critical applications.

1. Formal Problem Setting and Taxonomy

Domain-informed synthetic data generation seeks to produce data distributions PsynthP_\text{synth} that, beyond matching superficial statistics of a real dataset, also adhere to domain constraints, capture rare or structured phenomena, permit safe sharing, and maintain or improve downstream task performance. The general objective can be formalized as:

θ=argminθEdp(d),zZ[(Gθ(z,d),y)]\theta^* = \arg\min_\theta \mathbb{E}_{d \sim p(d), z \sim Z} \left[ \ell(G_\theta(z, d), y) \right]

where GθG_\theta is the parametric synthetic-data generator, dd encodes domain indicators, zz is a source of randomness, yy is a label, \ell is a downstream loss, and p(d)p(d) models the distribution of domain knowledge (Razmyslovich et al., 19 Mar 2025). Different instantiations enforce domain conformity via:

  • Discrete domain signals (indicator tokens, metadata prompts)
  • Structured rules (hard or soft constraints, knowledge graphs)
  • Physical/causal models (differentiable laws, causal graphs)
  • Stratified attribute distributions (label, length, or class balancing)
  • Multi-stage pipelines integrating extraction, filtering, and curation

This leads to a taxonomy organized by the source and function of domain information:

Family Mechanism Example Application
Rule-/Constraint-aware Boolean rules, knowledge graphs, regularizers Tabular data, privacy
Causal/model-based Structural equations, physical laws Causal fairness, physics
Indicator-/Prompt-based Domain prompts, token indicators LLM tasks, event detection
Statistics-guided filtering Domain-specific performance metrics Tabular/image augmentation
Domain randomization Priors on anatomical/pathological variation Medical imaging OOD

2. Domain Knowledge Extraction and Encoding

The fidelity and effectiveness of domain-informed generation critically depend on the systematic extraction, selection, and encoding of relevant domain signals.

Domain Indicator Extraction

A principled instance is ELTEX (Razmyslovich et al., 19 Mar 2025), which uses multi-LLM querying, iterative summarization, and deduplication to extract domain indicators dDd \in D from a seed corpus. Indicators—semantic tokens or feature vectors such as “wallet compromise”—anchor prompt-based LLM generation:

  1. Compile candidate domain indicators across multiple expert LLMs.
  2. Iteratively summarize and deduplicate with low-temperature LLM prompting until semantic convergence.
  3. Optionally validate with domain expertise for stabilization.

Rule and Constraint Encoding

Rule-adhering systems express domain constraints as Boolean indicator functions gj(x)g_j(x) over records. These are incorporated at training (penalty in likelihood/objective) or sampling (rejection or filtering). The constraint matrix CC approach unifies explicit and implicit rules, encoding both stated policies and inferred domain knowledge (Kotal et al., 9 Dec 2025, Platzer et al., 2022).

Causal Graphs and Knowledge Graphs

Causal generative models (CGMs) and knowledge-infused GANs define structural equations consistent with domain causal graphs or knowledge ontologies (Iommi et al., 20 Nov 2025, Kotal et al., 2024). Nodes/edges reflect domain attributes and their social, physical, or regulatory relationships. Knowledge graphs further enable categorical aggregation, property masking, and inferential flagging used as generator inputs or conditional vectors.

3. Synthesis Methodologies and Constraint Enforcement

Domain-informed synthesis frameworks implement a diverse range of generative models, unified by explicit infusions of domain knowledge.

Prompt-based LLM Pipelines

LLM-guided pipelines, exemplified by ELTEX, generate text conditioned on extracted domain indicators by dynamically constructing prompts that foreground domain tokens, include real-seed examples, and—in the case of classification tasks—yield synthetic corpora tailored for specialized downstream tasks (e.g., blockchain cyberattack detection). Indicator weighting heuristics (e.g., static prompt ordering, heuristic α in token logit summation) guide sample diversity and coherence (Razmyslovich et al., 19 Mar 2025).

Rule/Constraint-aware Neural Density Estimation

For mixed-type tabular data, rule-adherence can be enforced via:

  • Loss regularization: add λExPθ[j=1Rgj(x)]\lambda\,\mathbb{E}_{x \sim P_\theta}[\sum_{j=1}^R g_j(x)] to the negative log-likelihood (Platzer et al., 2022).
  • Sampling-time rejection: draw from PθP_\theta and accept only samples passing all rules.
  • GAN discriminator/auxiliary loss: penalize constraint violations in the generator loss (cross-entropy to expected property masks or constraint satisfaction) (Kotal et al., 2024, Kotal et al., 9 Dec 2025).

Physics and Causal Models

Domain-constrained diffusion models for scientific or engineering data incorporate differentiable constraints (e.g., Kirchhoff laws) as manifold guidance steps during sampling, while causal SCM-based CGMs encode generative processes faithful to domain DAGs, including mechanisms for bias modification and group-fairness auditing (Hoseinpour et al., 12 Jun 2025, Iommi et al., 20 Nov 2025).

Domain Randomization and Augmentation

Domain-specific randomization strategies (e.g., pathological anatomy priors, intensity clustering in imaging) combine expert-defined transformation ranges, label-space augmentation, and statistical models. These enable robust out-of-domain generalization for tasks such as rare disease segmentation or cross-scanner adaptation (Plana et al., 28 Aug 2025, Zalevskyi et al., 2024).

Data Filtering, Curation, and Integration

Statistical learning pipelines integrate large-scale synthetic data with real data by domain-specific filtering (Wasserstein/MMD distances on VAE embeddings; out-of-distribution generalization/boostability tests), selection of optimal mixing ratios, and automated multi-modal curation (object detection, vision-language alignment, aesthetic scoring, and user-preference classifiers) (Jiang et al., 8 May 2025, Yoon et al., 13 Jan 2026).

4. Case Studies and Empirical Evaluation

Prominent domain-informed synthetic data generation frameworks have demonstrated tangible gains on real-world tasks:

  • Blockchain Cyberattack Detection: ELTEX-generated data improved Gemma-2B F1 from 0.30 (unfined-tuned) to 0.76 (synthetic only), and to 0.81 in hybrid settings, rivaling larger commercial models (Razmyslovich et al., 19 Mar 2025).
  • NLI Domain Generalization: Domain and length-stratified synthetic NLI corpus (38 domains) led to a +7 ROC-AUC gain on the TRUE benchmark versus MNLI/WANLI baselines (Hosseini et al., 2024).
  • Tabular Data with Power-system Constraints: Guided diffusion achieved >95% constraint satisfaction with distributional fidelity, outperforming unguided approaches (Hoseinpour et al., 12 Jun 2025).
  • Privacy- and Regulation-Driven Synthesis: KIPPS and ContextGAN combined knowledge-graph or constraint-matrix mechanisms with DP-SGD in the discriminator, achieving high fidelity (e.g., PMSE = 0.24) and strong privacy resilience (membership inference ≈0.47) while preserving utility (<3% drop in ML accuracy) (Kotal et al., 2024, Kotal et al., 9 Dec 2025).
  • Cross-domain Retrieval: SynCDR used text-to-image diffusion bridge synthesis, filling category gaps to yield up to 15% Prec@1 improvement in zero-overlap benchmarks (Mishra et al., 2023).
  • Medical Imaging OOD Segmentation: Pathology-informed domain randomization reduced length error in corpus callosum estimation from 1.89 mm to 0.80 mm (healthy), 10.9 mm to 0.7 mm (pathological), and achieved improved topological consistency (Plana et al., 28 Aug 2025).
  • Causal Fairness Auditing: In recruitment, causal SDGs enabled audit/control of ranking unfairness via traceable bias parameters (e.g., Demographic Parity and rND) (Iommi et al., 20 Nov 2025).

5. Practical Guidelines, Limitations, and Recommendations

Domain-informed synthetic data requires careful balancing of realism, diversity, utility, and tractability. Key technical and practical recommendations include:

  • Start from high-quality, representative seed corpora; indicator or rule quality upper-bounds synthetic data diversity.
  • Multi-model or expert-guided extraction mitigates single-model blind spots.
  • Relax constraints during pre-training, enforcing strict adherence at sampling/inference if zero violation is essential.
  • For tabular and mixed-type data, express constraints as Boolean or functional indicators and automate their integration in both loss and sampling.
  • For privacy or fairness, adopt differential privacy in discriminators to ensure provable guarantees; parameterize sensitive edges in causal/graph models for auditing and controllable bias introduction (Kotal et al., 2024, Kotal et al., 9 Dec 2025, Iommi et al., 20 Nov 2025).
  • Evaluate downstream task performance (classification, calibration, segmentation) in addition to constraint/statistical fidelity; hybridization with real data often maximizes utility (Razmyslovich et al., 19 Mar 2025, Jiang et al., 8 May 2025).
  • For domain randomization, sample augmentation parameters from empirically grounded, narrow intervals reflecting observed or expert-verified ranges (Plana et al., 28 Aug 2025).
  • When scaling to high dimensions or large rule sets, precompute property mappings and vectorize constraint checks to maintain tractability.

Limitations cited include scalability of constraint checking in high-dimensional or graph-rich data, challenges in manually encoding large or implicit rule sets, the necessity of domain expertise for DAG/ontology construction, and the diminishing returns of additional synthetic data when domain drift or overfitting dominates.

6. Future Directions and Open Challenges

Emerging directions for domain-informed synthesis include:

  • Automated extraction and maintenance of constraint matrices from dynamic knowledge ontologies (Kotal et al., 9 Dec 2025).
  • Extension to federated and decentralized frameworks where domain knowledge is distributed and privacy requirements are stringent (Kotal et al., 2024).
  • Hybrid generative models (e.g., VAE-GAN, diffusion-GAN hybrids) with integrated domain penalties for higher-order, multi-modal, or temporal data (Kotal et al., 2024, Hoseinpour et al., 12 Jun 2025).
  • Automated or reinforcement-based tuning of penalty/constraint hyperparameters to optimally balance diversity and adherence.
  • Unified pipelines for scaling to rare phenomena (rare pathologies, long-tail attacks) and integrating human-in-the-loop validation at high throughput (Plana et al., 28 Aug 2025, Yoon et al., 13 Jan 2026).

7. Impact Across Domains

Domain-informed synthetic data serves as a critical enabler across sensitive, high-value, and data-scarce domains—including privacy-preserving medical and financial analytics, safety-critical systems, domain-specific retrieval and search, and clinical, legal, or regulatory text processing. Its dependence on codified domain signals, rules, and causal structures ensures both practical and principled alignment with real-world constraints, bridging the gap between raw generativity and deployable, trustworthy artificial datasets (Razmyslovich et al., 19 Mar 2025, Hoseinpour et al., 12 Jun 2025, Kotal et al., 2024, Iommi et al., 20 Nov 2025, Kotal et al., 9 Dec 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Domain-Informed Synthetic Data.