Universality of the $π^2/6$ Pathway in Avoiding Model Collapse
Abstract: Researchers in empirical machine learning recently spotlighted their fears of so-called Model Collapse. They imagined a discard workflow, where an initial generative model is trained with real data, after which the real data are discarded, and subsequently, the model generates synthetic data on which a new model is trained. They came to the conclusion that models degenerate as model-fitting generations proceed. However, other researchers considered an augment workflow, where the original real data continue to be used in each generation of training, augmented by synthetic data from models fit in all earlier generations. Empirical results on canonical datasets and learning procedures confirmed the occurrence of model collapse under the discard workflow and avoidance of model collapse under the augment workflow. Under the augment workflow, theoretical evidence also confirmed avoidance in particular instances; specifically, Gerstgrasser et al. (2024) found that for classical Linear Regression, test risk at any later generation is bounded by a moderate multiple, viz. pi-squared-over-6 of the test risk of training with the original real data alone. Some commentators questioned the generality of theoretical conclusions based on the generative model assumed in Gerstgrasser et al. (2024): could similar conclusions be reached for other task/model pairings? In this work, we demonstrate the universality of the pi-squared-over-6 augment risk bound across a large family of canonical statistical models, offering key insights into exactly why collapse happens under the discard workflow and is avoided under the augment workflow. In the process, we provide a framework that is able to accommodate a large variety of workflows (beyond discard and augment), thereby enabling an experimenter to judge the comparative merits of multiple different workflows by simulating a simple Gaussian process.
- Self-consuming generative models go MAD. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=ShjMHfmPs0.
- Self-improving diffusion models with synthetic data. arXiv preprint arXiv:2408.16333, 2024b.
- One step to efficient synthetic data. arXiv preprint arXiv:2006.02397, 2020.
- On the stability of iterative retraining of generative models on their own data. arXiv preprint arXiv:2310.00429, 2023.
- On the stability of iterative retraining of generative models on their own data. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=JORAfH2xFd.
- Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650–9660, 2021.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020.
- Model collapse demystified: The case of regression. arXiv preprint arXiv:2402.07712, 2024.
- Beyond model collapse: Scaling up with synthesized data requires reinforcement. arXiv preprint arXiv:2406.07515, 2024a.
- A tale of tails: Model collapse as a change of scaling laws. In ICLR 2024 Workshop on Navigating and Addressing Data Problems for Foundation Models, 2024b.
- Self-consuming generative models with curated data provably optimize human preferences. arXiv preprint arXiv:2407.09499, 2024.
- Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=5B2K4LRgmz.
- Will large-scale generative models corrupt future datasets? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20555–20565, 2023.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729–9738, 2020.
- Scaling laws for learning with real and surrogate data. arXiv preprint arXiv:2402.04376, 2024.
- Collapse or thrive? perils and promises of synthetic data in a self-generating world. arXiv preprint arXiv:2410.16713, 2024.
- Lucien Le Cam. Locally asymptotically normal families of distributions. certain approximations to families of distributions and their use in the theory of estimation and testing hypotheses. Univ. California Publ. Statist., 3:37, 1960.
- Testing statistical hypotheses, volume 3. Springer, 1986.
- Lightly-AI. Lightlyssl. https://github.com/lightly-ai/lightly, 2023. Accessed: Oct 1, 2024.
- Heat death of generative models in closed-loop learning. arXiv preprint arXiv:2404.02325, 2024.
- Combining generative artificial intelligence (ai) and the internet: Heading towards evolution or degradation? arXiv preprint arXiv:2303.01255, 2023a.
- Towards understanding the interplay of generative artificial intelligence and the internet. In International Workshop on Epistemic Uncertainty in Artificial Intelligence, pp. 59–73. Springer, 2023b.
- Self-distillation amplifies regularization in hilbert space. Advances in Neural Information Processing Systems, 33:3351–3361, 2020.
- How bad is training on synthetic data? a statistical analysis of language model collapse. arXiv preprint arXiv:2404.05090, 2024.
- The curse of recursion: Training on generated data makes models forget. arXiv preprint arXiv:2305.17493, 2023.
- Ai models collapse when trained on recursively generated data. Nature, 631(8022):755–759, 2024.
- Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.