Controllable Synthetic Data Pipeline

Updated 10 February 2026

Controllable Synthetic Data Pipeline is a modular framework that enables precise generation, manipulation, and validation of synthetic datasets for diverse machine learning applications.
It leverages advanced techniques such as latent variable models, diffusion processes, and promptable LLMs to offer granular control over data attributes and privacy-utility trade-offs.
Empirical evaluations demonstrate improved model accuracy, fairness, and robustness through simulation, privacy-preserving modules, and comprehensive downstream validations.

A controllable synthetic data pipeline is a modular, parameter-driven framework enabling precise generation, manipulation, and validation of synthetic datasets for a broad array of machine learning tasks. These pipelines are designed to maximize data utility for downstream learning, empirical evaluation, and regulatory compliance, while allowing explicit user control over the data generation process, structural properties, and privacy/utility trade-offs. Representative systems incorporate architectures ranging from latent variable models (e.g., VAE + diffusion), promptable LLMs, controllable image-generation workflows, and hybrid simulation–annotation loops. Below, principal design principles are expounded by reference to state-of-the-art research.

1. Pipeline Architectures and Modular Components

Controllable synthetic data pipelines share a staged architecture, typically encompassing: data preprocessing, flexible model-based generation, controllable sampling or manipulation, and standard downstream task validation.

For structured data, DiffLM exemplifies a VAE–diffusion–LLM pipeline. Real or structured data (tables, code, JSON APIs) are canonicalized into text/JSON. A trainable transformer encoder produces mean and variance vectors per sample, sampling latent variables $z$ to represent input $x$ (e.g., $z = \mu(x) + \sigma(x) \odot \epsilon$ ). The decoder is a frozen LLM (e.g., Mistral-7B) conditioned on "soft prompts" derived from $z$ through a learned MLP. A diffusion model further regularizes the latent space, ensuring that generations sample the true posterior over latent variables as opposed to a simplistic Gaussian prior. Generation is achieved by decoding from soft prompts representing sampled or interpolated latents, decoupling distribution learning from LLM generative objectives and sidestepping ad hoc prompt engineering (Zhou et al., 2024).

Image- and vision-oriented workflows, such as PUG and CtrlSynth, rely on high-level asset repositories, parameter-controlled simulation environments (e.g., Unreal Engine), programmatic scene configuration via APIs, and annotation modules to capture all control variables (object, pose, lighting, background) as ground truth. In CtrlSynth, images and captions are decomposed into atomic tags (objects, attributes, relations), edit operators manipulate content at the tag level, and composition engines (LLMs, diffusion models) synthesize new captions or images accordingly (Bordes et al., 2023, Cao et al., 2024).

Medical data pipelines such as RoentGen-v2 employ latent diffusion models conditioned on both text (clinical impression) and structured patient meta-data (age, sex, race/ethnicity), using fine-grained cross-attention-based conditioning to maintain clinical plausibility and demographic fairness (Moroianu et al., 22 Aug 2025).

For privacy-centric domains, SynthGuard integrates containerized pipelines with explicit governance, policy hooks, and modular privacy-preserving techniques. Each module is a composable container, supporting isolation, RBAC, policy validation, and audit logging (Brito et al., 14 Jul 2025).

2. Control Mechanisms and Parameterization

Achieving controllability requires exposure of application-relevant knobs at multiple levels:

Latent variable manipulation: In latent-diffusion pipelines, specific dimensions of $z$ can be fixed, interpolated, or randomized to effect attribute control (e.g., enforce class labels, interpolate styles, adjust diversity/fidelity). For instance, fixing part of the latent vector in DiffLM steers synthetic generation to a target class or subgroup (Zhou et al., 2024).
Structured prompt engineering: Attribute pipes such as SIG build prompts from pre-sampled attribute tuples (race, gender, age, pose). ControlNets are used to lock pose or line-art features during image synthesis (Nzalasse et al., 2024).
Scene and factor sampling: In simulation-based pipelines, all factors (object identity, pose, lighting, camera intrinsics/extrinsics, background) are sampled or set according to user-defined policies or distributions. PUG uses uniform or grid sampling across factor space for rigorous OOD control (Bordes et al., 2023).
Policy-defined edit operators: Systems like CtrlSynth allow explicit removal, addition, or replacement of semantic elements (tags) in image/text, supporting compositional reasoning tasks (Cao et al., 2024).
Demographic+semantic disentanglement: RoentGen-v2 injects independent embeddings for each demographic feature in cross-attention networks, ensuring controlled synthesis devoid of latent leakage between axes (Moroianu et al., 22 Aug 2025).

This parameterization enables not only data balancing (e.g., uniform sampling over demographic axes) but precise measurement of robustness and OOD generalization for downstream models.

3. Generative Models and Training Procedures

Controllable pipelines deploy a range of generative models, often chained or hybridized:

Modality	Example Pipeline	Encoder	Latent/Controllable Layer	Decoder / Synthesis
Structured	DiffLM (Zhou et al., 2024)	BERT-like	VAE + latent-diffusion	Frozen LLM + soft prompt injection
Image/Visual	PUG (Bordes et al., 2023)	Direct config	Scene-parameter vector	Unreal Engine / BlenderProc renderer
Multimodal	CtrlSynth (Cao et al., 2024)	ML classifier, tagger	Tag set + edit policies	LLM + Diffusion model
Medical Img	RoentGen-v2 (Moroianu et al., 22 Aug 2025)	CLIP ViT-L	Diffusion with prompt conditioning	U-Net diffusion, demog. cross-attn
Identity Face	SIG (Nzalasse et al., 2024)	Prompt + ControlNet	Text, pose mask	Stable Diffusion + ControlNets

Training typically proceeds in multiple decoupled stages. For example, DiffLM first trains the VAE encoder/decoder with annealed $\beta$ -VAE loss, then fits a diffusion model in the learned latent space, followed by learning the soft prompt mapping MLP while keeping the LLM frozen. No RLHF, prompt-tuning, or modification of pretrained model weights is needed outside of the injection modules.

For simulation pipelines, annotation is directly output by the rendering engine, enabling perfectly-aligned ground truth for supervised models.

4. Empirical Gains and Applications

Empirical evaluation across domains shows the effect of controllable pipelines:

Structured data: DiffLM synthetic data yields classifier/regressor accuracy improvements of 2–7% on tabular tasks compared to real data, with distribution statistics (KS, TV) matching real- or state-of-the-art synthetic benchmarks. For code, pass@1 metrics improved by ~7% over base models and outperformed strong open-source code models such as CodeLLaMA 7B (Zhou et al., 2024).
Radiographic imaging: RoentGen-v2’s synthetic pretraining increases downstream AUROC by +6.5% (vs. +2.7% for naive synthetic mixing), while simultaneously reducing fairness gaps (underdiagnosis gap -19.3%) (Moroianu et al., 22 Aug 2025).
Representation learning: PUG-controlled datasets enable rigorous quantification of distributional shift and OOD accuracy drops; e.g., ResNet50 test accuracy falls from 99% to ~80% on held-out backgrounds (Bordes et al., 2023).
Face recognition: SIG pipeline for ControlFace10k produces embedding distributions matching real-world BUPT datasets, with stratified mated/non-mated scores and demographic balance for bias assessment (Nzalasse et al., 2024).
Task-specific domains: CCUP for cloth-changing person ReID achieves mAP/R1 performance exceeding non-synthetic or baseline-synthetic pretraining by substantial margins (+10 pts mAP on PRCC for TransReID; +5.6 pts R1 for FIRe²), attributed to granularity of cloth control and camera diversity (Zhao et al., 2024).

These pipelines enable domain transfer, compositional augmentation, and robust evaluation previously unattainable with static, real-world datasets.

5. Privacy, Auditability, and Compliance

Pipelines intended for regulated applications implement modular privacy-preserving layers (e.g., SynthGuard). SDG workflows are decomposed into containerized pipeline steps, enabling the plug-in of DP mechanisms, TEEs, or MPC protocols. Computational governance is enforced via container isolation, policy enforcement points, signed artifacts, and execution audit logs. Data never leaves jurisdiction, and reports include automated utility and risk metrics (e.g., k-anonymity, pMSE, Kolmogorov–Smirnov distance). Compliance modules for regional data protection standards (GDPR, Data Act) can be attached as final validation gates (Brito et al., 14 Jul 2025).

Frameworks such as the Select–Generate–Audit approach allow explicit choice of which statistics may be reproduced (supporting task-specific utility), and black-box auditing is implemented as a two-sample test over excluded statistics to empirically confirm lack of privilege escalation or privacy leakage (Houssiau et al., 2022).

6. Limitations, Adaptation, and Future Directions

Common challenges and future opportunities include:

Sim-to-real gap: Although domain randomization and photorealistic rendering reduce distributional mismatch, a residual gap typically persists. For example, synthetic-to-real transfer in visual assembly control exhibited a ~0.2 mAP drop even with realistic simulation (Werheid et al., 16 Sep 2025).
VTM/Module dependencies: Multimodal control quality depends on the tagging or encoding ability in the relevant domain (CtrlSynth). Weak classifiers or unreliable prompts can constrain attainable controllability (Cao et al., 2024).
Domain adaptation: Adapting to a new domain may require: (i) new encoders/decoders (e.g., fine-tuning a VAE on small gold sets), (ii) reconfiguring annotation/scene assets, or (iii) composing novel prompt templates. Proper pipeline modularity ensures minimal pipeline code changes (Zhou et al., 2024, Bordes et al., 2023).
Compute and data scaling: Diffusion and LLM-based generation is compute-intensive, especially on million-scale image/text pair synthesis; closed-loop filtering partially mitigates quality risks (Cao et al., 2024).
Evaluation complexity: For privacy and utility, the choice of utility/validation metrics and privacy criteria remains context- and regulator-dependent. Pipelines supporting statistical auditing and custom utility hooks are favored for regulated environments (Brito et al., 14 Jul 2025, Houssiau et al., 2022).

Continued development is underway to extend control to more modalities (audio, video), integrate learned critics for sample selection, and provide compositional model-based control far beyond single-attribute or prompt engineering paradigms.

7. Summary Table: Exemplars of Controllable Synthetic Data Pipelines

Pipeline	Domain	Control Mechanisms	Key Model Types	Notable Features
DiffLM	Structured data, code	Latent diffusion + MLP	VAE + Diffusion + LLM	Modular latent decoupling, soft prompts
RoentGen-v2	Medical imaging	Text prompt, demos	Latent diffusion, CLIP	Fine-grained demo control, fairness supervision
PUG	Vision	Scene factors, JSON API	Unreal/BlenderProc	Photorealistic render, factorized sampling
CtrlSynth	Vision & multimodal	Tag edits, templates	LLM, CLIP, Diffusion	Policy-driven tag composition, closed-loop filtering
SynthGuard	Regulatory data	Pipeline DAG, audit	Containerized modules	Native DP/privacy/compliance, owner governance
CCUP	Person ReID	Identity, cloth, camera	MakeHuman, Unreal, detector	High granularity, automatic labeling, scalability
SIG	Face evaluation	Demographic prompts, mask	Stable Diff/ControlNet	Balanced synthetic faces, fairness benchmarking

These systems define the contemporary landscape in fully automatic, testable, and fine-grained synthetic data generation for model development and validation across highly diverse research domains.