Synthetic Image Learning: Preserving Performance and Preventing Membership Inference Attacks
Abstract: Generative artificial intelligence has transformed the generation of synthetic data, providing innovative solutions to challenges like data scarcity and privacy, which are particularly critical in fields such as medicine. However, the effective use of this synthetic data to train high-performance models remains a significant challenge. This paper addresses this issue by introducing Knowledge Recycling (KR), a pipeline designed to optimise the generation and use of synthetic data for training downstream classifiers. At the heart of this pipeline is Generative Knowledge Distillation (GKD), the proposed technique that significantly improves the quality and usefulness of the information provided to classifiers through a synthetic dataset regeneration and soft labelling mechanism. The KR pipeline has been tested on a variety of datasets, with a focus on six highly heterogeneous medical image datasets, ranging from retinal images to organ scans. The results show a significant reduction in the performance gap between models trained on real and synthetic data, with models based on synthetic data outperforming those trained on real data in some cases. Furthermore, the resulting models show almost complete immunity to Membership Inference Attacks, manifesting privacy properties missing in models trained with conventional techniques.
- Bridging the gap: Enhancing the utility of synthetic data via post-processing techniques. British Machine Vision Conference, 2023.
- Membership inference attacks against machine learning models. In Symposium on Security and Privacy, 2017.
- Generative adversarial networks. Advances in Neural Information Processing Systems, 2014.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 2020.
- Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. In International Conference on Learning Representations, 2024.
- Gan-based synthetic medical image augmentation for increased cnn performance in liver lesion classification. Neurocomputing, 2018.
- Generating synthetic medical images by using gan to improve cnn performance in skin cancer classification. In International Conference on Robotics and Mechatronics, 2019.
- Gan-based synthetic brain pet image generation. Brain informatics, 2020.
- Sgde: Secure generative data exchange for cross-silo federated learning. In International Conference on Artificial Intelligence and Pattern Recognition, 2022.
- Dataset distillation by matching training trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Fake it till you make it: Learning transferable representations from synthetic imagenet clones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Seeing is not necessarily believing: Limitations of biggans for data augmentation. International Conference on Learning Representations, 2019.
- This dataset does not exist: training models from generated images. In International Conference on Acoustics, Speech and Signal Processing, 2020.
- Stable diffusion dataset generation for downstream classification tasks. arXiv preprint arXiv:2405.02698, 2024.
- Classifier training from a generative model. In International Conference on Content-Based Multimedia Indexing, 2019.
- Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2015.
- Stealing machine learning models via prediction {{\{{APIs}}\}}. In USENIX Security Symposium, 2016.
- Poisoning attacks against support vector machines. In Proceedings of the International Conference on Machine Learning, 2012.
- Deep learning with differential privacy. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2016.
- On the utility and protection of optimization with differential privacy and classic regularization techniques. In International Conference on Machine Learning, Optimization, and Data Science, 2022.
- Discriminative adversarial privacy: Balancing accuracy and membership privacy in neural networks. British Machine Vision Conference, 2023.
- Privacy auditing with one (1) training run. Advances in Neural Information Processing Systems, 2024.
- Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
- Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2022.
- Trivialaugment: Tuning-free yet state-of-the-art data augmentation. In Proceedings of the IEEE/CVF international conference on computer vision, 2021.
- mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.
- Fast high-resolution image synthesis with latent adversarial diffusion distillation. arXiv preprint arXiv:2403.12015, 2024.
- Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2018.
- Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
- Improved techniques for training gans. Advances in Neural Information Processing Systems, 2016.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 2017.
- Consistency-diversity-realism pareto fronts of conditional image generative models. arXiv preprint, 2024.
- Classification accuracy score for conditional generative models. Advances in Neural Information Processing Systems, 2019.
- Algorithms for hyper-parameter optimization. Advances in Neural Information Processing Systems, 2011.
- Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 2018.
- Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.