Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World

Published 22 Oct 2024 in cs.LG and cs.AI | (2410.16713v4)

Abstract: What happens when generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models? Some prior work warns of "model collapse" as the web is overwhelmed by synthetic data; other work suggests the problem can be contained (i.e. collapse can be avoided) by managing how available data are used in pretraining. In this paper, we report experiments on three ways of using data (training-workflows), across three generative model task-settings (multivariate Gaussian estimation, kernel density estimation, and language-model fine-tuning) to further confirm the possibility of containment: (a) we confirm that the training-workflow of {\it replacing} all real data by successive generations of purely synthetic data indeed suffers model collapse in all task-settings studied; (b) we consider the training-workflow of {\it accumulating} synthetic data alongside real data and training on all data combined and confirming that, although the proportion of real data eventually becomes zero, models remain stable and their test losses do not diverge under this training-workflow; (c) we consider a training-workflow where real and synthetic data accumulate together but successive generations of pretraining are constrained to use fixed-size data subsets each generation. In this workflow, we observe slow and gradual rather than explosive degradation of test loss performance across generations. Our insights are particularly important when forecasting whether future frontier generative models will collapse or thrive, and our results open avenues for empirically and mathematically studying the context-dependent value of synthetic data.

Abstract PDF HTML Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper demonstrates that accumulating both real and synthetic data prevents model collapse, while solely replacing data degrades performance.
It validates findings through multivariate Gaussian modeling, kernel density estimation, and supervised fine-tuning of language models.
The results provide a practical framework for dataset construction and highlight future research paths to optimize synthetic data integration.

Overview of "Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World"

The paper "Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World" investigates the consequences of training generative machine learning models on large datasets that include synthetic data produced by earlier models. It addresses the critical question of whether future models will suffer from degradation, known as model collapse, or if they will continue to improve.

Key Scenarios Analyzed

The authors focus on three scenarios: the 'replace,' 'accumulate,' and a compromise scenario termed 'Accumulate-Subsample.' In the 'replace' scenario, models trained exclusively on synthetic data tend to degrade over time. Conversely, the 'accumulate' scenario, where new models are trained on all available real and synthetic data, helps avoid collapse, maintaining model performance across iterations. The 'Accumulate-Subsample' scenario introduces a fixed compute budget, suggesting that while model test loss on real data is higher than 'accumulate,' it stabilizes over time, unlike the divergence observed in 'replace.'

Methodologies and Evidence

The paper examines these scenarios in the context of several generative modeling tasks, including multivariate Gaussian modeling, kernel density estimation (KDE), and supervised fine-tuning of LLMs. In all settings, empirical and mathematical analyses consistently demonstrate that accumulating data prevents collapse, whereas replacing data leads to performance degradation. This evidences a broader phenomenon where the retention of past data stabilizes model outputs, suggesting a practical framework for future dataset construction in training models.

Numerical Findings

The paper provides robust numerical findings. For instance, kernel density estimation shows that model test loss increases when prior data are replaced, but remains stable when data accumulate. The authors also show that synthetic data can improve test loss under the 'accumulate' scenario, highlighting the nuanced role of synthetic data in model training.

Cardinality vs. Proportion of Real Data

An exploration into the cardinality and proportion of real data further reveals the complex interaction between real and synthetic data in preventing model collapse. Preliminary results suggest that both the absolute number and proportion of real data influence outcomes significantly, with synthetic data sometimes improving test loss when real data are scarce.

Theoretical and Practical Implications

The findings have substantial implications. Theoretically, they provide clarity on the dynamics of model-data feedback loops in generative models, challenging prior assumptions about model collapse inevitability. Practically, the insights direct future strategies for dataset construction, particularly emphasizing the retention and accumulation of data to enhance model robustness and accuracy.

Future Directions

The paper proposes several future research directions, such as optimizing the use of synthetic data alongside filtering techniques and developing robust removal methods for detrimental data. These pathways could significantly improve the efficiency and quality of model training and application.

Overall, this paper contributes valuable insights into the dynamics of synthetic data in AI model training, offering a framework to predict and guide the development of future generative models.

Markdown Report Issue