Self-Improving Diffusion Models with Synthetic Data

Published 29 Aug 2024 in cs.LG and cs.AI | (2408.16333v1)

Abstract: The AI world is running out of real data for training increasingly large generative models, resulting in accelerating pressure to train on synthetic data. Unfortunately, training new generative models with synthetic data from current or past generation models creates an autophagous (self-consuming) loop that degrades the quality and/or diversity of the synthetic data in what has been termed model autophagy disorder (MAD) and model collapse. Current thinking around model autophagy recommends that synthetic data is to be avoided for model training lest the system deteriorate into MADness. In this paper, we take a different tack that treats synthetic data differently from real data. Self-IMproving diffusion models with Synthetic data (SIMS) is a new training concept for diffusion models that uses self-synthesized data to provide negative guidance during the generation process to steer a model's generative process away from the non-ideal synthetic data manifold and towards the real data distribution. We demonstrate that SIMS is capable of self-improvement; it establishes new records based on the Fr\'echet inception distance (FID) metric for CIFAR-10 and ImageNet-64 generation and achieves competitive results on FFHQ-64 and ImageNet-512. Moreover, SIMS is, to the best of our knowledge, the first prophylactic generative AI algorithm that can be iteratively trained on self-generated synthetic data without going MAD. As a bonus, SIMS can adjust a diffusion model's synthetic data distribution to match any desired in-domain target distribution to help mitigate biases and ensure fairness.

Abstract PDF HTML Upgrade to Chat

Citations (5)

View on Semantic Scholar

Summary

The paper introduces the SIMS framework, a novel negative guidance method that uses synthetic data to prevent model autophagy disorder during training.
The methodology employs a base diffusion model and controlled fine-tuning with synthetic samples, achieving improved performance on benchmarks like CIFAR-10 and ImageNet-64.
Experimental results show that SIMS not only achieves state-of-the-art scores but also effectively shifts data distributions to ensure stability and fairness in generative tasks.

Self-Improving Diffusion Models with Synthetic Data

The paper "Self-Improving Diffusion Models with Synthetic Data" focuses on addressing the challenges of training diffusion models using synthetic data while avoiding the degradation of model performance, also known as Model Autophagy Disorder (MAD). The paper introduces the Self-IMproving diffusion models with Synthetic data (SIMS) framework, which enables the use of synthetic data to improve model performance without leading to model collapse.

Introduction

Diffusion models have become a popular approach for generative tasks, but they require significant amounts of real data for training. As real data becomes scarce, synthetic data generated from existing models is increasingly being used, which can lead to Model Autophagy Disorder (MAD), where the model's performance deteriorates over generations. SIMS proposes a novel training approach that incorporates synthetic data in such a way as to guide the model away from synthetic data pitfalls and maintain performance.

Figure 1: Self-IMproving diffusion models with Synthetic data (SIMS) simultaneously improves diffusion modeling and synthesis performance while acting as a prophylactic against Model Autophagy Disorder (MAD).

Methodology

SIMS Framework

SIMS introduces a method of negative guidance using synthetic data. Instead of treating synthetic data as equivalent to real data, SIMS uses this data to steer the generation process away from non-ideal synthetic distributions.

The SIMS procedure involves the following steps:

Train Base Diffusion Model: Train a base model on real data to obtain a score function.
Generate Auxiliary Synthetic Data: Create synthetic samples using this base model.
Train Auxiliary Diffusion Model: Fine-tune the base model with the synthetic data, but within a controlled training budget.
Extrapolate the Score Function: Use the scores from both the base and auxiliary models to guide the diffusion process, with a parameter $\omega$ controlling the strength of the guidance.

This approach allows for self-improvement by refining the model's generative process using its own outputs, moving it closer to the desired real-world distribution while avoiding the accumulation of errors typical in synthetic training loops.

Implementation Details

Training Budget: Controls the extent of fine-tuning on synthetic data to maintain effectiveness.
Synthetic Dataset Size: Must be balanced to ensure the auxiliary model represents a valid contrast without overwhelming the real data.
Guidance Parameter ( $\omega$ ): Critical in balancing the influence of the base and auxiliary models.

Experimental Results

Self-Improvement

SIMS demonstrates significant improvements over traditional training methods. The paper reports new state-of-the-art scores for popular datasets like CIFAR-10 and ImageNet-64, emphasizing its effectiveness in improving model performance using synthetic data.

Figure 2: Distribution shifting with SIMS.

MAD Prevention

Contrary to standard training practices that lead to model collapse under MAD, SIMS effectively prevents this degradation over iterative training rounds. Importantly, SIMS maintains or even improves performance levels, setting new benchmarks in preventing MAD with synthetic data augmentation.

Distribution Shifting

Beyond self-improvement, SIMS can adjust its output distribution to match any desired in-domain distribution, which is crucial for addressing biases and ensuring fairness in generated data.

Figure 3: Examples of distribution shifting capabilities in SIMS.

Discussion

SIMS offers a robust alternative to existing synthetic data training regimes by effectively managing the quality and bias issues associated with synthetic data. It achieves this by using a unique negative guidance methodology, ensuring the diffusion models remain aligned with real data distributions.

The implications of SIMS are significant for the future of training generative models, where real data scarcity and the influx of synthetic data are prominent challenges. As SIMS can iteratively enhance performance without succumbing to MAD, it represents a promising direction for sustainable model training practices.

Conclusion

The paper showcases SIMS as a pioneering approach for leveraging synthetic data in training diffusion models while safeguarding against MAD. By treating synthetic data as a guide rather than a direct substitute, SIMS enables ongoing model improvement and stability across generative tasks. This methodology not only improves the robustness of models but also paves the way for fair and un-biased synthetic data generation, vital for future AI applications.

Markdown