Diffusion Models Convergence

Updated 14 February 2026

Diffusion models convergence is characterized by quantitative error bounds linking the number of reverse steps, data dimensionality, and score estimation accuracy.
Recent advances employ rigorous SDE/ODE and CTMC frameworks under minimal assumptions to analyze both continuous and discrete state-space processes.
The theoretical results reveal practical iteration complexities that adapt to manifold structures and information-theoretic measures, optimizing generative accuracy.

Diffusion models are a class of generative models in which samples are generated by reversing a noising process via integration of a suitable SDE, ODE, or CTMC, guided by score functions that estimate the gradient of the log-density of time-marginals. The convergence of diffusion models refers to quantitative bounds for the error—typically measured in total variation (TV), Kullback-Leibler (KL) divergence, or Wasserstein distance—between the distribution output by the model and the true data distribution, as a function of the number of sampling (reverse) steps, the data dimension, and the accuracy of the estimated scores. Recent research has established increasingly sharp, assumption-minimal theories for both continuous and discrete state-spaces, with particular attention to dimensional dependence, score estimation error, discretization artifacts, and adaptability to structure such as low-dimensional manifolds.

1. Problem Formulation: Forward/Reverse Dynamics and Score Parameterization

Diffusion generative modeling begins with a forward process that perturbs data into noise, and a reverse process designed to reconstruct data from noise by inverting the forward dynamics. In $\mathbb{R}^d$ , a standard forward SDE is the variance-preserving (VP) Ornstein–Uhlenbeck process: $dX_t = -X_t\,dt + \sqrt{2}\,dB_t,\quad X_0\sim p,$ with marginals $q_t$ distributed as

$X_t = \sqrt{\bar\alpha_t} X_0 + \sqrt{1-\bar\alpha_t} W; \quad \bar\alpha_t\to 0 \text{ as } t\to T,$

and $W\sim \mathcal{N}(0, I_d)$ . The score function, $s_t^*(x) = \nabla \log q_t(x)$ , is learned from data. The reverse‐time process—either an SDE or a deterministic ODE (probability flow ODE)—depends on this score.

In discrete state spaces, the forward process is a CTMC, e.g., token-wise uniform flips or absorbing processes for masked models. The reverse process is another CTMC whose rates are modulated by discrete “score” ratios $s_t(x,y) = q_t(y)/q_t(x)$ .

Score-based models train a neural network to approximate $s_t^*$ , typically minimizing an empirical $L^2$ or entropy-type loss over noised samples. The reverse-time sampler, whether stochastic (as in DDPM) or deterministic (as in DDIM/probability-flow ODE), is implemented as a discretized chain (continuous: Euler or exponential integrator; discrete: time-homogeneous CTMC step).

2. Non-Asymptotic Convergence: Minimal Assumption Regimes

Sharp non-asymptotic convergence results have emerged under minimal data assumptions, dispensing with strong requirements such as Lipschitz scores, log-concavity, or smoothness. For target distributions on $\mathbb{R}^d$ with compact support or finite moments, and access to $dX_t = -X_t\,dt + \sqrt{2}\,dB_t,\quad X_0\sim p,$ 0-accurate Stein score estimates, the probability-flow ODE (DDIM-type) sampler achieves TV error

$dX_t = -X_t\,dt + \sqrt{2}\,dB_t,\quad X_0\sim p,$ 1

where $dX_t = -X_t\,dt + \sqrt{2}\,dB_t,\quad X_0\sim p,$ 2 is the number of steps, $dX_t = -X_t\,dt + \sqrt{2}\,dB_t,\quad X_0\sim p,$ 3 the ambient dimension, and $dX_t = -X_t\,dt + \sqrt{2}\,dB_t,\quad X_0\sim p,$ 4 are the mean $dX_t = -X_t\,dt + \sqrt{2}\,dB_t,\quad X_0\sim p,$ 5 score and Jacobian estimation errors. With perfect scores, TV convergence within $dX_t = -X_t\,dt + \sqrt{2}\,dB_t,\quad X_0\sim p,$ 6 requires $dX_t = -X_t\,dt + \sqrt{2}\,dB_t,\quad X_0\sim p,$ 7 (Li et al., 2024).

For continuous-time samplers with only finite first/second moments and $dX_t = -X_t\,dt + \sqrt{2}\,dB_t,\quad X_0\sim p,$ 8-score accuracy,

$dX_t = -X_t\,dt + \sqrt{2}\,dB_t,\quad X_0\sim p,$ 9

steps suffice to achieve KL error $q_t$ 0 after adding Gaussian noise of variance $q_t$ 1 (Benton et al., 2023). In the stochastic setting (DDPM), TV error scales as $q_t$ 2 under minimal conditions, improving over prior $q_t$ 3 rates (Li et al., 2024).

Discrete-state models on $q_t$ 4 or $q_t$ 5 attain similar $q_t$ 6 scaling in KL or TV, provided the learned (discrete) scores meet suitable pathwise entropy or Fisher-information controls. These results extend to masking, uniform, and absorbing diffusion, and remove previous constraints on score boundedness or data support (Zhang et al., 2024, Conforti et al., 29 Nov 2025, Liang et al., 2 Jun 2025).

3. Influence of Score Approximation and Discretization Error

The end-to-end accuracy is determined not only by the underlying dynamics and step-size, but crucially by score estimation error. For samplers with discrete-time score updates $q_t$ 7, $q_t$ 8-mean error $q_t$ 9 per step contributes

$X_t = \sqrt{\bar\alpha_t} X_0 + \sqrt{1-\bar\alpha_t} W; \quad \bar\alpha_t\to 0 \text{ as } t\to T,$ 0

to the total TV error for ODE samplers (Li et al., 2024), and similar (up to $X_t = \sqrt{\bar\alpha_t} X_0 + \sqrt{1-\bar\alpha_t} W; \quad \bar\alpha_t\to 0 \text{ as } t\to T,$ 1 and logarithmic factors) for SDE/DDPM samplers. Consistent score learning via empirical risk minimization at rate $X_t = \sqrt{\bar\alpha_t} X_0 + \sqrt{1-\bar\alpha_t} W; \quad \bar\alpha_t\to 0 \text{ as } t\to T,$ 2 translates directly into $X_t = \sqrt{\bar\alpha_t} X_0 + \sqrt{1-\bar\alpha_t} W; \quad \bar\alpha_t\to 0 \text{ as } t\to T,$ 3 sampling accuracy.

Discretization error accumulates as $X_t = \sqrt{\bar\alpha_t} X_0 + \sqrt{1-\bar\alpha_t} W; \quad \bar\alpha_t\to 0 \text{ as } t\to T,$ 4 (worst-case), but instance-dependent results reveal a spectrum: for scores with small Lipschitz constant $X_t = \sqrt{\bar\alpha_t} X_0 + \sqrt{1-\bar\alpha_t} W; \quad \bar\alpha_t\to 0 \text{ as } t\to T,$ 5, the required steps to achieve TV error $X_t = \sqrt{\bar\alpha_t} X_0 + \sqrt{1-\bar\alpha_t} W; \quad \bar\alpha_t\to 0 \text{ as } t\to T,$ 6 interpolate as

$X_t = \sqrt{\bar\alpha_t} X_0 + \sqrt{1-\bar\alpha_t} W; \quad \bar\alpha_t\to 0 \text{ as } t\to T,$ 7

(Jiao et al., 2024). For mixtures of Gaussians, this yields dimension-free convergence, as $X_t = \sqrt{\bar\alpha_t} X_0 + \sqrt{1-\bar\alpha_t} W; \quad \bar\alpha_t\to 0 \text{ as } t\to T,$ 8 with $X_t = \sqrt{\bar\alpha_t} X_0 + \sqrt{1-\bar\alpha_t} W; \quad \bar\alpha_t\to 0 \text{ as } t\to T,$ 9 mixture components (Li et al., 7 Apr 2025).

Systematic error decomposition—density-ratio recursion, Taylor expansion, score-movement coupling, Girsanov-based pathwise KL, or discrete analogue thereof—is the technical core in most proofs. Concentration arguments explicitly handle rare-events, while telescopic summation of one-step errors quantifies overall convergence.

4. Impact of Problem Structure: Manifold and Information-Theoretic Adaptivity

The presence of intrinsic low-dimensional structure or “easy” measure properties can dramatically accelerate convergence. When the target law is supported on a $W\sim \mathcal{N}(0, I_d)$ 0-dimensional smooth submanifold of $W\sim \mathcal{N}(0, I_d)$ 1, the optimal rate in KL divergence is $W\sim \mathcal{N}(0, I_d)$ 2 up to logarithmic terms, independent of ambient $W\sim \mathcal{N}(0, I_d)$ 3 (Potaptchik et al., 2024). Minimax score accuracy, sampling complexity, and discretization error all become functions of intrinsic dimension and (when working in KL) may be fully ambient-dimension–free (Azangulov et al., 2024).

Information-theoretic analyses connect convergence to Shannon entropy or mutual information, yielding dimension-free bounds (KL $W\sim \mathcal{N}(0, I_d)$ 4, $W\sim \mathcal{N}(0, I_d)$ 5 entropy, using geometric signal-to-noise ratio grids in time) (Aghapour et al., 29 Jan 2026). For diffusion LLMs, the error decays as $W\sim \mathcal{N}(0, I_d)$ 6, scaling linearly with the sum of per-token mutual information (Li et al., 27 May 2025).

In discrete models, similar “low-dimensional adaptivity” appears via Fisher-information or entropy, as well as by schedule design for masking and random walk processes, so that complexity may scale with the number of “relevant” coordinates up to logs.

5. Extensions: Discrete, Manifold, and Riemannian Settings

The convergence theory is robust across a range of data spaces:

Discrete state spaces ([S]^d, $W\sim \mathcal{N}(0, I_d)$ 7, or sequences with absorbing mask tokens): Uniformization-based sampling yields KL and TV errors of the form $W\sim \mathcal{N}(0, I_d)$ 8, and the pathwise analysis derandomizes the process to achieve convergence without requiring uniform score bounds (Chen et al., 2024, Liang et al., 2 Jun 2025, Conforti et al., 29 Nov 2025). Absorbing chains (masking) can improve iteration-complexity to $W\sim \mathcal{N}(0, I_d)$ 9 in KL, outperforming uniform dynamics.
Manifold and non-Euclidean geometry: Under mild $s_t^*(x) = \nabla \log q_t(x)$ 0-regularity and bounded curvature, polynomial-time convergence in TV holds for Riemannian diffusion models using only $s_t^*(x) = \nabla \log q_t(x)$ 1-accuracy scores, with total-variation error $s_t^*(x) = \nabla \log q_t(x)$ 2 in $s_t^*(x) = \nabla \log q_t(x)$ 3 steps, where $s_t^*(x) = \nabla \log q_t(x)$ 4 is the spectral gap (Xu et al., 5 Jan 2026).
Original DDPM under general schedules: For the original DDPM, an explicit total-variation bound is established that decomposes into terms for initial Gaussian mismatch, averaged learning error on the score network, and discretization artifacts, all vanishing for reasonable schedules and $s_t^*(x) = \nabla \log q_t(x)$ 5 score error (Nakano, 2024).
Deterministic samplers: Unified frameworks for deterministic ODE-based samplers (probability-flow ODE, DDIM) yield similar bounds, with polynomial scaling $s_t^*(x) = \nabla \log q_t(x)$ 6 for variance-preserving OU with exponential-integrator schemes under $s_t^*(x) = \nabla \log q_t(x)$ 7-accurate score and divergence estimation (Li et al., 2024).

6. Methodological Innovations and Theoretical Comparison

Recent advances introduce technically novel tools to enable these sharp convergence rates under weak assumptions:

Discrete-time, non-asymptotic analysis eliminates reliance on PDE or SDE theory; proofs are based on recursive density-ratio formulae, elementary change-of-variable, and direct concentration inequalities (Li et al., 2024, Li et al., 2023).
Stochastic localization techniques refine control over discretization errors, reducing necessary smoothness or dimension scaling (Benton et al., 2023).
Information-theoretic reformulations (e.g., MMSE and conditional covariance control) yield dimension-free results and enable adaptive, loss-driven stepping schedules (Aghapour et al., 29 Jan 2026).
Instance-dependent rates interpolate between worst-case and “easy-data” regimes, and concretely match known structure in cases such as Gaussian mixtures (Jiao et al., 2024, Li et al., 7 Apr 2025).
Absorbing boundary and discrete score monotonicity underpin improved analysis of masking, random walk, and categorical diffusion models (Liang et al., 2 Jun 2025, Conforti et al., 29 Nov 2025).

Comparison to prior work demonstrates that older analyses often required exponential-in-d or superlinear dependence, strong regularity, or log-Sobolev/poincaré inequalities. Current state-of-the-art results reach (or nearly reach) optimal $s_t^*(x) = \nabla \log q_t(x)$ 8 or faster iteration complexity for continuous and discrete samplers, with adaptability to structure and near-minimal data assumptions.

Table: Representative Convergence Rates for Diffusion Samplers

Sampler Type	Iteration Complexity (To TV/KL Error $s_t^*(x) = \nabla \log q_t(x)$ 9)	Assumptions
Probability-Flow ODE (DDIM)	$s_t(x,y) = q_t(y)/q_t(x)$ 0	Compact support, $s_t(x,y) = q_t(y)/q_t(x)$ 1-score error (Li et al., 2024)
Stochastic sampler (DDPM)	$s_t(x,y) = q_t(y)/q_t(x)$ 2	1st/2nd moment, $s_t(x,y) = q_t(y)/q_t(x)$ 3-score error (Li et al., 2023)
Discrete CTMC, uniform or mask rates	$s_t(x,y) = q_t(y)/q_t(x)$ 4 in KL (absorbing); TV similar	Score entropy bound, mild integrability (Liang et al., 2 Jun 2025)
Manifold-supported target (KL)	$s_t(x,y) = q_t(y)/q_t(x)$ 5	C $s_t(x,y) = q_t(y)/q_t(x)$ 6-regularity, intrinsic $s_t(x,y) = q_t(y)/q_t(x)$ 7 (Potaptchik et al., 2024)
GMM data (KL/TV)	$s_t(x,y) = q_t(y)/q_t(x)$ 8	GMM target or approximation (Li et al., 7 Apr 2025)

7. Synthesis and Outlook

The convergence properties of diffusion models are now quantitatively understood under a wide variety of regimes: continuous and discrete, Euclidean and Riemannian, deterministic and stochastic, with explicit iteration complexity as a function of dimension, data structure, and score error. The most recent theories show that, for generic high-dimensional data, the leading scaling can be near-linear in (intrinsic) dimension and (nearly) optimal in error tolerance. Specialized bounds for Gaussian mixtures and manifold support explain the practical efficiency of diffusion samplers on real data.

Ongoing directions include developing end-to-end statistical rates that jointly account for sample, training, and discretization error; extending theory to more complex discrete settings (e.g., language, graphs); and refining accelerated or adaptive integration methods guided by data geometry and information-theoretic quantities. The theoretical landscape now offers a calibrated map for both practical algorithm design and further foundational investigation in generative modeling.