- The paper introduces a scalable workflow integrating modular design, containerization, and computational governance to enable privacy-preserving synthetic data generation.
- It employs a Data Mesh-inspired architecture that supports on-premises, cloud, and hybrid deployments while incorporating rigorous privacy and utility evaluations.
- Empirical results demonstrate sublinear runtime scaling and robust compliance across diverse use cases, underscoring its practical utility in regulated sectors.
SynthGuard: A Scalable, Privacy-Preserving Framework for Synthetic Data Generation
SynthGuard introduces a workflow-centric framework for synthetic data generation (SDG) that addresses the operational, legal, and technical challenges of privacy-preserving data sharing in sensitive domains. The framework is motivated by the increasing demand for secure, auditable, and scalable SDG, particularly in sectors such as healthcare, finance, and law enforcement, where regulatory compliance and data sovereignty are paramount.
Motivation and Problem Statement
Existing SDG solutions predominantly focus on improving generative models or evaluating the privacy and utility of synthetic data. However, they often neglect the architectural and operational requirements necessary for real-world deployment, such as data sovereignty, computational governance, and compliance with evolving legal standards. Many current approaches require data transfer or external processing, introducing privacy risks and undermining data owner control. SynthGuard is designed to overcome these limitations by enabling data owners to retain full control over SDG workflows, ensuring that sensitive data and computations remain within their jurisdiction.
Requirements Elicitation and Architectural Principles
SynthGuard’s design is grounded in requirements elicited from two large-scale EU research projects (LAGO and TEADAL), encompassing diverse use cases: law enforcement, evidence-based medicine, financial governance, smart viticulture, mobility, and regional planning. The requirements span anonymization, data sovereignty, compliance validation, interoperability, modularity, and scalability.
The architectural model is inspired by Data Mesh principles, emphasizing:
- Domain ownership: Data and computation remain under the control of the data owner.
- Computational governance: Workflows are executed within environments governed by regulatory and operational constraints.
- Data sovereignty: Sensitive data is never exposed off-premises during SDG.
This model supports modular, auditable, and reproducible workflows, facilitating compliance and trust in cross-organizational data sharing.
Framework Implementation
SynthGuard operationalizes these principles through a modular, containerized workflow system built on Nix, Kubernetes, and Kubeflow Pipelines, with Argo Workflows as the pipeline specification standard. The core components include:
- Pipeline Specification: SDG workflows are defined as portable, declarative artifacts (YAML/Argo Workflows), constructed from modular Python components. These specifications encode the full SDG process, including data loading, preprocessing, generation, and evaluation.
- Deployment Models: SynthGuard supports on-premises, cloud, and hybrid deployments, with options for container- or VM-based isolation. This flexibility allows data owners to tailor security and scalability to their operational context.
- Evaluation Mechanisms: Privacy and utility assessments are integrated into the workflow, using metrics such as observed/null pMSE, Kolmogorov-Smirnov distance, CategoricalCAP, NewRowSynthesis, and TCAP. These evaluations are executed in parallel, leveraging workflow orchestration for scalability.
The framework does not introduce new generative models but provides a standardized, auditable environment for integrating existing SDG methods (e.g., CTGAN, rule-based generators) into compliant workflows.
Validation and Empirical Results
SynthGuard was validated across six use cases, with iterative feedback from domain experts. The framework was deployed in local, on-premises, and cloud environments, demonstrating adaptability and compliance with data sovereignty requirements. Empirical benchmarks on law enforcement datasets (1K, 10K, 100K rows) show sublinear scaling of total runtime (1.6 min, 5.1 min, 16 min, respectively), with over 90% of runtime at scale attributed to privacy and quality evaluations. The modular, parallel execution model mitigates bottlenecks, confirming the framework’s scalability.
SynthGuard’s features were mapped to project requirements, demonstrating fulfillment of anonymization, sovereignty, compliance, interoperability, modularity, and scalability objectives. The framework enables data owners to validate and approve SDG pipelines before execution, ensuring that only compliant synthetic data is shared externally.
Implications and Future Directions
SynthGuard advances the operationalization of SDG by providing a practical, governance-oriented framework that aligns with regulatory and organizational requirements. Its modular, auditable design supports the emergence of standardized, reusable SDG pipelines, facilitating responsible data sharing across domains and jurisdictions.
Practical implications include:
- Enabling organizations to deploy SDG workflows without relinquishing control over sensitive data.
- Supporting automated compliance validation and auditability, critical for regulated sectors.
- Providing a foundation for future synthetic data marketplaces, where pipeline specifications become shareable, trustable artifacts.
Theoretical implications:
- SynthGuard demonstrates the feasibility of integrating Data Mesh principles into SDG, bridging the gap between generative model research and real-world deployment.
- The framework highlights the importance of workflow standardization, modularity, and computational governance in privacy-preserving data ecosystems.
Future work will extend SynthGuard to multi-table relational data, integrate secure multi-party computation and trusted execution environments, and adopt dataspace protocols for broader interoperability. Further automation of compliance checks and performance optimization (e.g., GPU/TPU acceleration) are also planned.
Conclusion
SynthGuard represents a significant step toward scalable, privacy-preserving, and compliant synthetic data generation. By prioritizing data owner control, modularity, and auditability, it addresses the critical challenges of operationalizing SDG in sensitive and regulated environments. The framework’s validation across diverse use cases and its strong empirical results underscore its practical utility and potential as a foundation for future developments in privacy-enhancing data sharing.