Diffusion Curriculum: Synthetic-to-Real Data Curriculum via Image-Guided Diffusion

Published 17 Oct 2024 in cs.CV and cs.AI | (2410.13674v4)

Abstract: Low-quality or scarce data has posed significant challenges for training deep neural networks in practice. While classical data augmentation cannot contribute very different new data, diffusion models opens up a new door to build self-evolving AI by generating high-quality and diverse synthetic data through text-guided prompts. However, text-only guidance cannot control synthetic images' proximity to the original images, resulting in out-of-distribution data detrimental to the model performance. To overcome the limitation, we study image guidance to achieve a spectrum of interpolations between synthetic and real images. With stronger image guidance, the generated images are similar to the training data but hard to learn. While with weaker image guidance, the synthetic images will be easier for model but contribute to a larger distribution gap with the original data. The generated full spectrum of data enables us to build a novel "Diffusion Curriculum (DisCL)". DisCL adjusts the image guidance level of image synthesis for each training stage: It identifies and focuses on hard samples for the model and assesses the most effective guidance level of synthetic images to improve hard data learning. We apply DisCL to two challenging tasks: long-tail (LT) classification and learning from low-quality data. It focuses on lower-guidance images of high-quality to learn prototypical features as a warm-up of learning higher-guidance images that might be weak on diversity or quality. Extensive experiments showcase a gain of 2.7% and 2.1% in OOD and ID macro-accuracy when applying DisCL to iWildCam dataset. On ImageNet-LT, DisCL improves the base model's tail-class accuracy from 4.4% to 23.64% and leads to a 4.02% improvement in all-class accuracy.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces a novel diffusion-guided curriculum that progressively integrates synthetic and real data to enhance model robustness.
It employs both non-adaptive and adaptive scheduling strategies to dynamically adjust data integration based on model performance.
Quantitative evaluations show significant accuracy improvements through iterative synthetic sample filtering using CLIPScore criteria.

Diffusion Curriculum: Synthetic-to-Real Data Curriculum via Image-Guided Diffusion

Introduction

The paper "Diffusion Curriculum: Synthetic-to-Real Data Curriculum via Image-Guided Diffusion" (2410.13674) explores a novel strategy for enhancing machine learning model performance by systematically combining synthetic and real-world data through a curriculum based on image-guided diffusion models. This approach addresses the commonly observed performance gaps between models trained on synthetic versus real data by leveraging a carefully orchestrated blend of both data types using diffusion processes and incorporates a curriculum learning paradigm.

Methodology

The core methodology revolves around the construction of a synthetic dataset that progressively evolves with guidance from real-world data. The authors employ a diffusion model $\epsilon_{\theta}$ to generate synthetic samples that mimic hard samples identified by a pre-trained model $P_\theta$ as being difficult to classify. The synthetic dataset is generated by varying image guidance levels $\lambda_i \in [0, 1)$ , with lower values indicating more synthetic-like data and higher values more real-like data. The process is iterative, with synthetic data being filtered through a CLIPScore model to ensure quality and relevance.

A curriculum learning strategy is applied in two forms: non-adaptive and adaptive. The non-adaptive strategy uses a predefined schedule for integrating synthetic data into the training set, while the adaptive strategy dynamically adjusts data integration based on model performance on a validation set. This multi-stage training process intends to fine-tune model weights based on progressively more challenging data scenarios, both simulated and real.

Implementation Details

Synthetic Data Generation

Identify Hard Samples: Compute the predicted probability $p_i$ for each sample in the original dataset. Collect samples with $p_i < h_{\text{hard}}$ as hard samples $\mathcal{H}$ .
Generate Synthetic Samples:
- For each hard sample, extract the corresponding text prompt $t_i$ .
- Use the diffusion model $\epsilon_{\theta}$ to generate multiple synthetic samples for each image using varying $\lambda_j$ .
Filter Synthetic Samples: Retain samples exhibiting acceptable CLIPScores to ensure high-quality synthetic data.

Training with Curriculum

Curriculum Initialization: Start training using only hard samples and a mix of synthetic samples driven by a predefined or adaptive schedule.
Predefined Schedule: Use a pre-established progression of synthetic-to-real data (non-adaptive).
Adaptive Schedule: Adjust the difficulty dynamically based on performance metrics from a validation set, selecting samples that improve model accuracy the most.

Results

Quantitative evaluations demonstrate substantial improvements in model robustness and accuracy when synthetic data, curated via a diffusion-guided approach, is integrated using curriculum learning. The adaptive curriculum strategy notably outperforms the non-adaptive approach, underscoring the importance of flexible data integration strategies driven by real-time model performance evaluation.

Implications and Future Work

This research implies that diffusion models can offer significant advantages in data-driven AI by effectively emulating challenging real-world scenarios through controlled synthetic data generation. The adaptive curriculum strategy presents a promising pathway for continuous model improvement by providing the model with a stringent, gradually increasing learning environment.

Future work may focus on scaling this approach across different domains and model architectures, exploring the impacts of different diffusion parameters, and applying similar strategies to other data modalities such as audio or text.

Conclusion

The paper presents a comprehensive framework that combines diffusion models with curriculum learning to bridge the synthetic-real data performance gap. Through innovative strategies in data generation and training progression, this research advances the integration of complex synthetic data in machine learning applications, setting the stage for more resilient and versatile AI models.