Error Propagation and Model Collapse in Diffusion Models: A Theoretical Study

Published 18 Feb 2026 in stat.ML and cs.LG | (2602.16601v1)

Abstract: Machine learning models are increasingly trained or fine-tuned on synthetic data. Recursively training on such data has been observed to significantly degrade performance in a wide range of tasks, often characterized by a progressive drift away from the target distribution. In this work, we theoretically analyze this phenomenon in the setting of score-based diffusion models. For a realistic pipeline where each training round uses a combination of synthetic data and fresh samples from the target distribution, we obtain upper and lower bounds on the accumulated divergence between the generated and target distributions. This allows us to characterize different regimes of drift, depending on the score estimation error and the proportion of fresh data used in each generation. We also provide empirical results on synthetic data and images to illustrate the theory.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that error accumulation in recursive training can lead to model collapse if fresh data is insufficiently incorporated.
It introduces theoretical upper and lower divergence bounds using Girsanov’s theorem and observability coefficients to quantify score estimation errors.
Empirical validation on Gaussian mixtures and CIFAR-10 highlights the critical role of the fresh data fraction (α) in maintaining model stability.

Error Propagation and Model Collapse in Diffusion Models: A Theoretical Study

Introduction and Problem Setting

Machine learning systems increasingly leverage synthetic data, especially in generative model pipelines. A prominent failure mode—termed model collapse—is observed when a generative model is recursively trained on its own outputs: distributional mass drifts toward high-density cores while diversity and fidelity erode. While prior theoretical work on recursive training focused on regression or maximum-likelihood estimators, a comprehensive quantitative analysis for score-based diffusion models remained unavailable.

This work presents a rigorous analysis of error propagation and model collapse in recursively trained score-based diffusion models, where each round of training incorporates both synthetic and a fraction $\alpha$ of fresh data sampled from the true data distribution. The central quantities tracked are:

Accumulated divergence: $D_i = \chi^2(\hat{p}^i \| \mathrm{data})$ , measuring model drift from the target distribution at generation $i$ ,
Intra-generation divergence: $I_i = \chi^2(\hat{p}^{i+1} \| q_i)$ , quantifying divergence induced by one training round, where $q_i = \alpha\, \mathrm{data} + (1-\alpha)\, \hat{p}^i$ .

The propagation of score estimation errors and their impact on model collapse is characterized using pathwise statistics induced by the diffusion processes. This analysis clarifies how model collapse is mitigated by fresh data and exacerbated by error accumulation.

Theoretical Framework

Recursive Training Dynamics

Each generation begins with a mixture of fresh and synthetic samples. Training a score-based diffusion model on $q_i$ gives a new model $\hat{p}^{i+1}$ , which again partakes as synthetic data in subsequent rounds. This recursion is expressed as: $\hat{p}^i \rightarrow q_i = \alpha\,\mathrm{data} + (1-\alpha)\hat{p}^i \rightarrow \hat{p}^{i+1}$

The underlying training target becomes the score of $q_i$ , not the true data distribution, introducing a structural misalignment that is exacerbated by imperfect score estimation.

Divergence Bounds: Upper and Lower

The intra-generational divergence $I_i$ is tightly bounded by the pathwise energy of the score error. Two critical results arise:

Upper Bound (via Girsanov's theorem):

$\mathrm{KL}(\hat{p}^{i+1} \| q_i) \leq \frac{1}{2}\hat{\varepsilon}_i^2$

where $\hat{\varepsilon}_i^2$ is the pathwise $L^2$ energy of the score error along the learned process.

Lower Bound (with observability):

$\chi^2(\hat{p}^{i+1} \| q_i) \geq \frac{1}{8}\eta_i \varepsilon_{\star,i}^2$

Critically, the observability coefficient $\eta_i \in [0,1]$ measures the fraction of pathwise error that leaves a statistical imprint at the endpoint. $\eta_i$ is typically nonzero in practical parametric models with state-dependent error.

The two-sided control is formalized as: $c_1\,\eta_i\,\varepsilon_{\star,i}^2 \leq \chi^2(\hat{p}^{i+1} \| q_i) \leq c_2\,\varepsilon_{\star,i}^2$ in the perturbative regime where score error is small.

Intergenerational Error Accumulation

The effect of the fresh data fraction $\alpha$ is to contract model divergence at each generation by $(1-\alpha)^2$ , while the newly introduced score error increases divergence: $D_{i+1} = (1-\alpha)^2 D_i + \text{(innovation due to score error)}$ Closed-form analysis reveals:

If $\sum_i \varepsilon_{\star,i}^2 = \infty$ , then accumulated divergence never vanishes—model collapse is inevitable.
If $\sum_i \varepsilon_{\star,i}^2 < \infty$ , the accumulated divergence remains uniformly bounded.

The long-term divergence, after $N$ generations, admits a discounted sum structure: $D_{N+1} \asymp \sum_{k=0}^N (1-\alpha)^{2(N-k)} \varepsilon_{\star,k}^2$ Errors from past generations are exponentially forgotten, with rate determined by $\alpha$ .

Numerical Experiments and Empirical Validation

Synthetic Data: Gaussian Mixture

Experiments with 10-dimensional Gaussian mixtures validate the theory. Low $\alpha$ (little fresh data) leads to rapid divergence:

Figure 1: Samples from recursively trained models shown via PCA on a 10D Gaussian mixture; columns increase in $\alpha$ from left to right, rows progress through generations. Low $\alpha$ exhibits fast dispersal/collapse, while high $\alpha$ maintains stability.

Correspondingly, the intra-generational error bounds and the intergenerational accumulation law are empirically tight:

Figure 2: Empirical validation of intra-generational error upper/lower bounds, supporting the tightness of the theoretical predictions.

Figure 3: Two-sided control of intra-generation divergence, showing close agreement between theoretical and observed $\chi^2$ and KL divergences as functions of score error energy.

Figure 4: Memory heatmap visualizes the geometrically-discounted influence of errors from previous generations; sharp diagonal for high $\alpha$ (short memory), wide band for low $\alpha$ (long memory and more persistent collapse).

Key finding: For high $\alpha$ (e.g., $\alpha = 0.9$ ), distributional drift is nearly eliminated, and divergence is stable across many recursive generations.

Observability of Score Error

Controlled experiments on CIFAR-10 show observability coefficient $\eta_i$ is consistently nonzero for state-dependent perturbations:

Figure 5: Observability coefficients for several classes of perturbations in CIFAR-10, confirming that state-dependent errors are more ‘visible’ and thus lead to statistically significant divergence.

Figure 6: Observability coefficient $\hat{\eta}_i$ remains nonzero across generations (10D Gaussian Mixture), confirming persistent error visibility and hence the relevance of lower divergence bounds.

Visual Effects of Model Collapse

The visual impact of collapse under recursive training is apparent in sample quality and diversity:

Figure 7: Random samples over generations under three $\alpha$ rates in a recursive pipeline; low $\alpha$ leads to rapid mode collapse, high $\alpha$ maintains diversity.

Implications and Outlook

The theoretical construction and empirical results make several strong contributions:

Provable divergence lower bounds for diffusion models via score error observability, demonstrating that error is not hidden but statistically manifest.
Identification of a discounted memory principle: geometric forgetting of past errors with rate set by the fresh data fraction $\alpha$ .
Contradicts the naive hypothesis that bounded per-round error always suffices for stability; accumulation can overwhelm contraction if errors are not summable.

Practical implications include principled selection of $\alpha$ to prevent collapse and direct estimation of safe training horizons given per-generation error statistics. The observability framework generalizes to more realistic models and high-dimensional settings, as confirmed with image datasets.

Theoretical implications include insight into structural sources of collapse, the role of conditional independence and state dependence in error propagation, and the importance of pathwise statistics.

Conclusion

This study establishes rigorous, quantitative links between pathwise score estimation error, error visibility (observability), and model collapse in recursive diffusion model training. The results precisely characterize the interplay between fresh data injection and unavoidable error accumulation, with empirical validation across both synthetic and real data domains. Open questions include analyzing large-error regimes, discrete-time implementations, and characterizing ultimate model fixed points under the recursive process. This framework provides a robust foundation for future developments in the reliable self-improvement of generative models and recursive pipelines.

Figure 8: Observability coefficients on CIFAR-10 show stability across generations and further corroborate nonzero projection of error energy onto the output distribution.