Differentially Private Synthetic Distillation

Updated 3 February 2026

The paper demonstrates DP synthetic distillation's ability to decouple private data access from model training by transferring knowledge from a DP teacher to a student model.
Methodologies include DP-SGD, Gaussian mechanisms, and adversarial generator training to produce reliable, privatized synthetic data for various applications.
Empirical results across NLP and vision tasks highlight improved utility over direct DP approaches, achieving higher accuracy and reduced privacy risk.

Differentially private synthetic distillation is a family of algorithmic frameworks for compressing, transferring, and deploying models trained on sensitive data while preserving differential privacy by leveraging synthetic data generation and knowledge distillation. The core paradigm is to train a “teacher” model on private data using rigorous DP mechanisms and then distill its knowledge to a compact “student” model, either by generating and labeling synthetic data or by directly optimizing the student with privatized signals derived from the teacher. This approach fundamentally disjoins access to private data from student training, enabling strong privacy guarantees and improved utility relative to traditional DP-SGD or pure data-free approaches. Methods in this area cover a wide spectrum, from data-free distillation with DP labeling, to dataset distillation using DP feature matching, to federated and decentralized protocols, and span application domains including LLMs, vision backbones, and federated settings.

1. Formal Privacy Model and General Principles

Differentially private synthetic distillation methods universally adopt the (ε, δ)-differential privacy definition: for any pair of neighboring datasets $D, D'$ differing in a single sample, and for any measurable set of outputs $S$ ,

$\Pr[\mathcal{M}(D)\in S] \leq e^\varepsilon \Pr[\mathcal{M}(D')\in S] + \delta.$

This formalism constrains the observable outputs of any mechanism $\mathcal{M}$ (e.g., a released model or synthetic dataset) to leak at most the specified privacy budget about individual data points. The architecture of differentially private synthetic distillation methods is built to ensure that all paths from private input to public output are protected:

Any operation over private data (e.g., gradient computation, labeling) is privatized typically via DP-SGD or the Gaussian mechanism.
Synthetic data generation and subsequent student model training leverage the post-processing property, incurring no further privacy loss.
When multiple DP mechanisms are composed (e.g., generation, feature matching, expert guidance), advanced composition or Gaussian Differential Privacy (GDP) theorems are used to tightly account for the total budget (Shi et al., 13 Nov 2025).

In all practical deployments, privacy parameters are selected to balance utility and risk, with recommended ranges varying by modality and dataset complexity (Liu et al., 27 Jan 2026, Shi et al., 13 Nov 2025).

2. Algorithmic Frameworks for DP Synthetic Distillation

Several principal frameworks have been established, each with distinct mechanisms for privatizing the knowledge transfer.

a. Teacher-Privatized Synthetic Distillation

A pre-trained teacher $T$ is DP-fine-tuned on the private dataset $D$ using DP-SGD (per-example gradient clipping, Gaussian noise addition, and privacy accounting via the Moments Accountant). Once $T$ is certified as $(\epsilon,\delta)$ -DP with respect to $D$ , synthetic data is generated by sampling from $T$ —this does not consume additional privacy budget.

A student model $S$ 0 is then trained on the synthetic data with:

Cross-entropy loss on hard (one-hot) labels,
KL-divergence on soft (probabilistic) teacher outputs,
Optional last-layer hidden state alignment.

The total student loss is: $S$ 1 with hyperparameters $S$ 2. The entire pipeline is DP by the closure of DP under post-processing—the student is as private as the teacher (Flemings et al., 2024).

b. Data-Free Differentially Private Distillation

Here, access to the original training data is assumed to be completely absent. A generator $S$ 3 is trained adversarially (teacher as a fixed discriminator) to synthesize inputs that elicit diverse, confident outputs from the teacher (cross-entropy, entropy, and activation penalties). After generator optimization, synthetic samples are labeled by the teacher and passed through a (possibly pure) DP-labeling mechanism, such as selective randomized response.

The student is then trained with privatized labels, minimizing KL divergence or cross-entropy. Utility/accuracy remains surprisingly high at moderate privacy budgets, and the generator itself can be made publishable, enabling data augmentation or downstream DP learning (Liu et al., 2023, Liu et al., 2024, Liu et al., 27 Jan 2026).

c. Privacy-Preserving Dataset Distillation

In dataset distillation, a small synthetic set $S$ 4 is optimized to mimic the training signal or feature statistics of the original data. Differential privacy is achieved by introducing noise at one or more DP-critical bottlenecks:

DP feature matching: compute, perturb, and match class-wise feature means between real and synthetic data, only privatizing signals extracted from the original dataset.
Subspace projection (SER): project signals to an informative subspace to increase signal-to-noise under DP noise constraints.
DP data generation: initialize and guide distillation using a DP-generated large surrogate dataset (Shi et al., 13 Nov 2025, Zheng et al., 3 Aug 2025).

Privacy budget allocation and advanced privacy accounting are critical to optimize utility—front-loading privacy cost into synthetic data generation and feature extraction proves highly effective (Shi et al., 13 Nov 2025).

d. Federated and Distributed DP Distillation

Federated DP dataset distillation as in SFDD (Arazzi et al., 19 Feb 2025) orchestrates gradient-matching based distillation across distributed clients—each client computes updates privately and only exchanges synthetic data or masked updates. Local Differential Privacy (LDP) is conferred through label obfuscation (LDPO-RLD) or by adding Gaussian noise to client updates, and robust aggregation mitigates inference or backdoor attacks.

3. Technical Components and Mechanism Variants

The performance and privacy guarantees of synthetic distillation are strongly influenced by the specific DP mechanism employed at signal transfer bottlenecks.

Mechanism	DP Guarantee	Utility-Impacting Features
DP-SGD	$S$ 5-DP	Gradient clipping, noise on parameter updates
Gaussian Mechanism	$S$ 6-DP (RDP/GDP)	Controls on feature/label signals, minimal composition when possible
Laplace Mechanism	$S$ 7-DP (pure) on counts/votes	Used in DP ensemble labeling (Ge et al., 2024)
Selective Randomized Response	$S$ 8-Label-DP (pure)	Label privacy, single-use, avoids composition (Liu et al., 2024, Liu et al., 27 Jan 2026)
Subspace Projection (SER)	Composed with upstream DP	Amplifies SNR for feature-based matching (Zheng et al., 3 Aug 2025)

Key tricks for utility preservation include projecting matching signals onto public-data-informed subspaces (Zheng et al., 3 Aug 2025), post-processing synthesis to avoid “noise poisoning” the critical optimization, and multi-phase pipelines that re-distill/initialize models using synthetic or DP-generated data (Ngong et al., 2024, Shi et al., 13 Nov 2025).

4. Privacy Accounting and Budget Allocation

A central challenge is balancing privacy budget expenditure across multiple DP mechanisms:

When multiple mechanisms operate sequentially (e.g., DP-generated data, DP feature matching, DP expert-guidance), privacy loss composes as the $S$ 9 sum of GDP parameters and is converted to $\Pr[\mathcal{M}(D)\in S] \leq e^\varepsilon \Pr[\mathcal{M}(D')\in S] + \delta.$ 0 via analytical bounds (Shi et al., 13 Nov 2025).
Post-processing property ensures that once data has been privatized (e.g., via DP-SGD on a teacher or a generator), all downstream model training and synthetic data sampling incur zero additional privacy cost.
In label-privacy settings, pure DP is achieved by applying LabelDP mechanisms (randomized response) once per synthetic point and re-using labels—no composition penalty (Liu et al., 2024, Liu et al., 27 Jan 2026).
Careful calibration of noise multipliers and selection of signal bottlenecks (e.g., only privatizing low-dimensional outputs/features rather than full gradients) is crucial for closing the utility gap to non-private baselines (Zheng et al., 3 Aug 2025, Shi et al., 13 Nov 2025).

5. Empirical Evaluation and Comparative Performance

Differentially private synthetic distillation methods consistently outperform prior privacy-preserving baselines—both DP-GANs and DP-SGD training—across vision and language domains. For example, in text, DistilDP (Flemings et al., 2024) yields test perplexity reductions of $\Pr[\mathcal{M}(D)\in S] \leq e^\varepsilon \Pr[\mathcal{M}(D')\in S] + \delta.$ 1 points over direct DP-SGD tuning at $\Pr[\mathcal{M}(D)\in S] \leq e^\varepsilon \Pr[\mathcal{M}(D')\in S] + \delta.$ 2. In images, DP-GenG (Shi et al., 13 Nov 2025) reaches accuracy of 65.5% on CIFAR-10 at $\Pr[\mathcal{M}(D)\in S] \leq e^\varepsilon \Pr[\mathcal{M}(D')\in S] + \delta.$ 3, compared to the next best of $\Pr[\mathcal{M}(D)\in S] \leq e^\varepsilon \Pr[\mathcal{M}(D')\in S] + \delta.$ 4 (NDPDC).

A summary of characteristic results is below:

Method	Dataset/Task	$\Pr[\mathcal{M}(D)\in S] \leq e^\varepsilon \Pr[\mathcal{M}(D')\in S] + \delta.$ 5	Utility (Accuracy/PPL)	Key Baseline	Gain
DistilDP	Big Patent	2	PPL 32.43	DP-SGD PPL 41.80	$\Pr[\mathcal{M}(D)\in S] \leq e^\varepsilon \Pr[\mathcal{M}(D')\in S] + \delta.$ 6 PPL
DPDFD	MNIST	1	95.12%	DP-GAN 40.4%	$\Pr[\mathcal{M}(D)\in S] \leq e^\varepsilon \Pr[\mathcal{M}(D')\in S] + \delta.$ 7 pts
DP-GenG	CIFAR-10	10	65.5%	NDPDC 53.9%	$\Pr[\mathcal{M}(D)\in S] \leq e^\varepsilon \Pr[\mathcal{M}(D')\in S] + \delta.$ 8 pts
DP-DFD	MNIST	1	97.62%	GS-WGAN 14.32%	$\Pr[\mathcal{M}(D)\in S] \leq e^\varepsilon \Pr[\mathcal{M}(D')\in S] + \delta.$ 9 pts
DPSD	FMNIST	1	83.97%	DataLens ~71%	$\mathcal{M}$ 0 pts
DP-SAD	CelebA 64x64	10	FID 11.26	DP-LDM FID 14.3	$\mathcal{M}$ 1 FID

Notably, federated protocols as in SFDD (Arazzi et al., 19 Feb 2025) achieve centralized-level accuracy (within 1–2%) using LDPO-RLD label smoothing as the primary privacy mechanism, while robustness to adversarial inference and backdoor attacks is explicitly evaluated.

6. Extensions, Limitations, and Research Directions

Extensions across the literature include:

Ensemble-based teacher architectures with DP aggregation for further improved privacy-utility tradeoffs (Ge et al., 2024).
Self-distillation refinement to mitigate DP-induced degradation in LLMs (Ngong et al., 2024).
Cross-architecture generalization, showing that distilled datasets and models retain much of their performance when deployed to new architectures (Shi et al., 13 Nov 2025).

Limitations, flagged in the cited works, include:

Synthetic data realism can be compromised by DP noise, although front-loading noise into the generator and leveraging subspace projections substantially alleviates this (Zheng et al., 3 Aug 2025, Shi et al., 13 Nov 2025).
Computational overhead is significant for three-player adversarial frameworks and for complex subspace or expert-guided supervision.
For federated techniques, the privacy/robustness balance relies on client plurality and honest client assumptions (Arazzi et al., 19 Feb 2025).

Ongoing research focuses on more efficient signal utilization (e.g., decoupled optimization/sampling), improved budget allocation, and enhanced robustness against modern inference attacks.

7. Application Domains and Impact

Differentially private synthetic distillation is now foundational for deploying high-utility, privacy-preserving models:

In NLP, DistilDP, DPRefine, and related methods enable compressed LLMs for sensitive-domain compliance (Flemings et al., 2024, Ngong et al., 2024).
In computer vision, dataset distillation with DP is applicable for synthetic medical imaging, face attribute recognition, and federated vision learning (Shi et al., 13 Nov 2025, Liu et al., 2024, Arazzi et al., 19 Feb 2025).
The approaches are increasingly being applied in decentralized and cross-device environments to address privacy-by-design requirements in real-world deployments.

The state-of-the-art utility levels and formal privacy protections established place differentially private synthetic distillation as a central tool in the privacy-preserving machine learning ecosystem.