- The paper introduces the SIMS framework, a novel negative guidance method that uses synthetic data to prevent model autophagy disorder during training.
- The methodology employs a base diffusion model and controlled fine-tuning with synthetic samples, achieving improved performance on benchmarks like CIFAR-10 and ImageNet-64.
- Experimental results show that SIMS not only achieves state-of-the-art scores but also effectively shifts data distributions to ensure stability and fairness in generative tasks.
Self-Improving Diffusion Models with Synthetic Data
The paper "Self-Improving Diffusion Models with Synthetic Data" focuses on addressing the challenges of training diffusion models using synthetic data while avoiding the degradation of model performance, also known as Model Autophagy Disorder (MAD). The paper introduces the Self-IMproving diffusion models with Synthetic data (SIMS) framework, which enables the use of synthetic data to improve model performance without leading to model collapse.
Introduction
Diffusion models have become a popular approach for generative tasks, but they require significant amounts of real data for training. As real data becomes scarce, synthetic data generated from existing models is increasingly being used, which can lead to Model Autophagy Disorder (MAD), where the model's performance deteriorates over generations. SIMS proposes a novel training approach that incorporates synthetic data in such a way as to guide the model away from synthetic data pitfalls and maintain performance.
Figure 1: Self-IMproving diffusion models with Synthetic data (SIMS) simultaneously improves diffusion modeling and synthesis performance while acting as a prophylactic against Model Autophagy Disorder (MAD).
Methodology
SIMS Framework
SIMS introduces a method of negative guidance using synthetic data. Instead of treating synthetic data as equivalent to real data, SIMS uses this data to steer the generation process away from non-ideal synthetic distributions.
The SIMS procedure involves the following steps:
- Train Base Diffusion Model: Train a base model on real data to obtain a score function.
- Generate Auxiliary Synthetic Data: Create synthetic samples using this base model.
- Train Auxiliary Diffusion Model: Fine-tune the base model with the synthetic data, but within a controlled training budget.
- Extrapolate the Score Function: Use the scores from both the base and auxiliary models to guide the diffusion process, with a parameter ω controlling the strength of the guidance.
This approach allows for self-improvement by refining the model's generative process using its own outputs, moving it closer to the desired real-world distribution while avoiding the accumulation of errors typical in synthetic training loops.
Implementation Details
- Training Budget: Controls the extent of fine-tuning on synthetic data to maintain effectiveness.
- Synthetic Dataset Size: Must be balanced to ensure the auxiliary model represents a valid contrast without overwhelming the real data.
- Guidance Parameter (ω): Critical in balancing the influence of the base and auxiliary models.
Experimental Results
Self-Improvement
SIMS demonstrates significant improvements over traditional training methods. The paper reports new state-of-the-art scores for popular datasets like CIFAR-10 and ImageNet-64, emphasizing its effectiveness in improving model performance using synthetic data.

Figure 2: Distribution shifting with SIMS.
MAD Prevention
Contrary to standard training practices that lead to model collapse under MAD, SIMS effectively prevents this degradation over iterative training rounds. Importantly, SIMS maintains or even improves performance levels, setting new benchmarks in preventing MAD with synthetic data augmentation.
Distribution Shifting
Beyond self-improvement, SIMS can adjust its output distribution to match any desired in-domain distribution, which is crucial for addressing biases and ensuring fairness in generated data.

Figure 3: Examples of distribution shifting capabilities in SIMS.
Discussion
SIMS offers a robust alternative to existing synthetic data training regimes by effectively managing the quality and bias issues associated with synthetic data. It achieves this by using a unique negative guidance methodology, ensuring the diffusion models remain aligned with real data distributions.
The implications of SIMS are significant for the future of training generative models, where real data scarcity and the influx of synthetic data are prominent challenges. As SIMS can iteratively enhance performance without succumbing to MAD, it represents a promising direction for sustainable model training practices.
Conclusion
The paper showcases SIMS as a pioneering approach for leveraging synthetic data in training diffusion models while safeguarding against MAD. By treating synthetic data as a guide rather than a direct substitute, SIMS enables ongoing model improvement and stability across generative tasks. This methodology not only improves the robustness of models but also paves the way for fair and un-biased synthetic data generation, vital for future AI applications.