Annealed Langevin Dynamics

Updated 8 February 2026

Annealed Langevin Dynamics is a class of Markov process methodologies that combines time-inhomogeneous Langevin dynamics with an annealing schedule to efficiently sample from complex, multimodal distributions.
It constructs an interpolation path from an easy-to-sample reference to the target, enabling polynomial complexity for non-log-concave sampling challenges.
ALD underpins advances in score-based generative modeling, Bayesian inference, and optimization by overcoming the exponential slow mixing of standard Langevin Monte Carlo.

Annealed Langevin Dynamics (ALD) is a class of Markov process methodologies for sampling, inference, and optimization that combines time-inhomogeneous Langevin diffusion with an annealing or continuation schedule. ALD was initially motivated by the need to efficiently sample from complex, non-log-concave or multimodal target distributions for which standard Langevin Monte Carlo (LMC) may mix exponentially slowly. The ALD framework introduces a family of intermediate distributions, or an “annealing path,” interpolating between an easy-to-sample reference and the target, and defines a Langevin dynamics (continuous or discrete) that tracks this path by progressively modifying the drift coefficient. Quantitative, non-asymptotic theoretical analyses of ALD reveal polynomial oracle complexities for broad classes of non-convex sampling problems, breaking the exponential bottleneck of LMC on multimodal landscapes, and highlight the critical role of the interpolation “action” along the path of measures (Guo et al., 2024). ALD also underpins many recent advances in score-based generative modeling, inverse problems, multimodal detection, and high-dimensional Bayesian inference.

1. Mathematical Formulation and Algorithmic Structure

The canonical version of continuous-time ALD is defined by a family of target densities $\{\pi_\theta\}_{\theta\in[0,1]}$ with

$\pi_\theta(x) \propto \exp\left(-\,\eta(\theta)V(x)\;-\;\tfrac{\lambda(\theta)}{2}\|x\|^2\right),$

where $V$ is the target potential (assumed $\beta$ -smooth) and $\eta,\lambda$ are monotone $C^1$ annealing schedules interpolating between an initial “easy” density $\pi_0$ and the target $\pi_1 = \pi$ . The state $X_t$ evolves according to

$dX_t = -\bigl[\eta(t/T)\nabla V(X_t) + \lambda(t/T) X_t\bigr]\,dt + \sqrt{2}\,dB_t,\qquad X_0\sim\pi_0,$

with time scaled via $t\mapsto\theta=t/T$ over $[0,T]$ (Guo et al., 2024, Cordero-Encinar et al., 13 Feb 2025, Cattiaux et al., 13 Nov 2025).

In discrete time, this yields the Annealed Langevin Monte Carlo (ALMC) update:

$x_k \;=\; \Lambda_0(\theta_k,\theta_{k-1})\,x_{k-1} \;-\;H(\theta_k,\theta_{k-1})\,\nabla V(x_{k-1}) \;+\;\Lambda_1(\theta_k,\theta_{k-1})\,\xi_k,$

where the coefficients $\Lambda_0$ , $H$ , and $\Lambda_1$ are defined by the schedule integrals and $\xi_k\sim\mathcal N(0,I_d)$ (Guo et al., 2024).

These constructions generalize to diffusion- and score-based models, including nonlinear interpolations (e.g., convolution/denoising paths as in diffusion generative models (Cordero-Encinar et al., 13 Feb 2025, Cattiaux et al., 13 Nov 2025)) and to kinetic or higher-order Langevin systems (He et al., 2022, Zilberstein et al., 2023).

2. Theoretical Guarantees: Complexity and Error Bounds

ALD enables polynomial-time sampling from non-log-concave and multimodal targets when the path action

$\mathcal A = \int_0^1 \dot\pi_\theta^2 \, d\theta$

is finite, where $\dot\pi_\theta$ is the $W_2$ -metric derivative of the interpolation curve $(\pi_\theta)$ . For potentials $V$ that are $\beta$ -smooth and targets with finite second moment, the main complexity bound for ALMC is:

$M = \widetilde{O}\left( \frac{d\,\beta^2\,\mathcal A^2}{\varepsilon^6} \right)$

gradient evaluations to achieve $\mathrm{KL}(\nu||\pi) \leq \varepsilon^2$ , where $M$ is the number of steps, $d$ the dimension, and $\varepsilon$ the target KL error (Guo et al., 2024). The proof uses a Girsanov argument comparing the ALD path to an idealized flow matching the interpolants exactly; the action $\mathcal A$ quantifies the “difficulty” of moving probability mass between $\pi_0$ and $\pi$ .

In the diffusion-based setting (DALMC), analogous non-asymptotic KL bounds hold with path action controlled by moments of the data and schedule regularity. For the Gaussian path,

$\mathcal A \lesssim C_\lambda (M_2 + d),$

where $M_2$ is the second moment and $C_\lambda$ a constant depending on the schedule (Cordero-Encinar et al., 13 Feb 2025). For error $\leq \varepsilon^2$ : $M=O\left( d(M_2\vee d)^2L_{\max}^2 \varepsilon^{-6} \right)$ .

Comparison with classical LMC: On multimodal or non-log-concave targets, LMC without annealing can be exponentially slow due to poor isoperimetry or lack of a functional inequality. ALD, by contrast, achieves polynomial complexity under only smoothness and path action conditions, without requiring a log-Sobolev inequality (Guo et al., 2024).

3. Design of the Annealing Path and Schedule

The performance of ALD critically depends on the design of the path of intermediate distributions. Canonical choices include mixtures of the target with simple (e.g., Gaussian) reference measures, or convolutional diffusions:

$\mu_t = (\mathcal N(0,\sigma^2 I)\ast(\pi_\mathrm{data}\circ x\mapsto x/\sqrt{\lambda_t}))$

as in score-based generative models (Cordero-Encinar et al., 13 Feb 2025, Cattiaux et al., 13 Nov 2025).

For multimodal Gaussian mixture targets, explicit smoothing paths (via Gaussian convolution) allow dimension-free complexity when the smoothing and preconditioning spectra decay sufficiently rapidly:

$\sum_{j} \frac{\lambda_j^2}{\gamma_j \sigma_{ij}} < \infty, \quad \forall i$

with $\lambda_j$ and $\gamma_j$ the smoothing and drift-preconditioning eigenvalues, and $\sigma_{ij}$ the component variances (Baldassari et al., 1 Feb 2026). Preconditioning is essential for robustness of ALD to score errors or misspecified initialization in high dimension.

Annealing schedule choices (e.g., uniform, geometric, and cosine interpolants) influence the total path action. Usage of adaptive or non-uniform time grids is an area of ongoing research to minimize action and thus improve complexity (Guo et al., 2024).

4. Discretization and Implementation Considerations

ALD is typically implemented via Euler–Maruyama discretization. The step size and number of steps are determined by the target accuracy, path smoothness, and the Lipschitz constant of the interpolant scores. For a uniform schedule, a discretization error of $O(d\,\beta^2\,T^2/M)$ is controllable provided $M \gtrsim d\,\beta^2\,\mathcal A^2/\varepsilon^6$ (Guo et al., 2024). In the diffusion path setting, step size per level $h_\ell$ is set $O(\varepsilon^6/[d\,(M_2\vee d)^2L_{\max}^2])$ to ensure error control (Cordero-Encinar et al., 13 Feb 2025).

Each ALD step requires one gradient (score) evaluation of $V$ at each intermediate $\theta$ , plus possible quadratic or convolutional terms from the reference distribution. If the score function must be learned, the sample complexity is determined by the $L^2$ or $L^4$ norm of score-estimation error, with robustness improved under preconditioning (Baldassari et al., 1 Feb 2026).

In higher-order or kinetic extensions, e.g., underdamped or generalized Langevin equations, ALD can be implemented with splitting integrators (e.g., BAOAB) to propagate momentum and memory variables (Zilberstein et al., 2023, He et al., 2022, Chak et al., 2020). This yields accelerated convergence and empirical improvements in multimodal/multibarrier energy landscapes.

5. Applications in Multimodal Sampling, Inference, and Generative Modeling

Non-log-concave and Multimodal Sampling

ALD provides the first polynomial-time provable algorithm for non-log-concave, multimodal sampling without isoperimetric or Lyapunov drift conditions, as long as the path action is finite (Guo et al., 2024, Baldassari et al., 1 Feb 2026). Notably, for Gaussian mixtures with $N$ modes at radius $r$ , ALMC achieves

$M = \widetilde O \left( d^3\,\beta^2\,r^4\,(r^4\beta^2\vee d^2)\,\varepsilon^{-6} \right)$

while LMC under classical LSI scaling would require time exponential in $r^2$ .

Score-based Diffusion Models

DALMC instantiates ALD along diffusion or convolutional paths, bridging a base distribution to complex high-dimensional data; this enables generative sampling from images or audio with explicit KL error control (Cordero-Encinar et al., 13 Feb 2025, Cattiaux et al., 13 Nov 2025, Kameoka et al., 2020). With Student's $t$ bases, DALMC supports heavy-tailed data. Integration with pretrained score networks underpins practical applications in image synthesis and voice conversion.

Bayesian Inference and Inverse Problems

ALD and its Annealed Langevin Monte Carlo variant are utilized for posterior sampling in high-dimensional linear inverse problems, e.g., compressed sensing, MIMO detection, and imaging, achieving lower symbol or reconstruction error rates and robustness to nonconvexity (Zilberstein et al., 2023, Zilberstein et al., 2022, Zilberstein et al., 2022, Zilberstein et al., 2022).

Optimization via Annealed Sampling

In simulated annealing, ALD and its kinetic versions provide global optimization guarantees for nonconvex objectives under logarithmic cooling schedules, with rates controlled by energetic barrier heights. Mean-field ALD generalizes this to the Wasserstein-gradient flow setting for measures (He et al., 2022, Chizat, 2022, Pareschi, 2024).

6. Extensions, Practical Recommendations, and Open Directions

ALD generalizes naturally to higher-order (underdamped, memory-augmented, or generalized Langevin) dynamics, which empirically and sometimes theoretically accelerate mixing and escape from traps in non-convex, rough, high-dimensional landscapes (Zilberstein et al., 2023, He et al., 2022, Chak et al., 2020). For multimodal and ill-conditioned scenarios, preconditioning is essential to obtain dimension-robust performance (Baldassari et al., 1 Feb 2026).

Practical parameter guidelines include: (i) choosing annealing schedules that keep the path action low (via schedule smoothness and endpoint singularity control); (ii) tuning step size and the number of annealing levels to control discretization bias; (iii) employing preconditioning aligned to local geometry to equalize mixing rates; and (iv) adopting robust or score-learning strategies to mitigate score estimation or initialization errors.

Open research questions include: tightening the dependence on $\varepsilon$ (e.g., reducing exponent $6$); constructing optimal or adaptive paths that minimize action; extending theory to non-smooth or heavy-tailed potentials; and proving matching lower bounds that establish the near-optimality of the ALD framework (Guo et al., 2024, Baldassari et al., 1 Feb 2026).

7. Comparison with Standard Methods and Limitations

ALD offers rigorous complexity and convergence guarantees where standard LMC either fails to mix or requires restrictive functional inequalities (e.g., LSI). For multimodal, non-log-concave targets, ALD both empirically and theoretically surpasses LMC, with only finite moment and smoothness assumptions, and with bias or oracle complexity controlled by geometric properties of the chosen interpolation path.

However, the cost is polynomial in dimension, smoothness, and inverse error but with relatively high exponents in $\varepsilon$ ; moreover, the path action $\mathcal A$ can be large if modes are widely separated. The need for closed-form or efficiently computable intermediate path scores at each annealing level is also a limitation in certain models or data settings (Guo et al., 2024, Baldassari et al., 1 Feb 2026, Cordero-Encinar et al., 13 Feb 2025).

In summary, Annealed Langevin Dynamics offers a general, theoretically principled, and practically effective framework for sampling, inference, and global optimization in high-dimensional, nonconvex, and multimodal scenarios, with a rich structure admitting accelerated, preconditioned, and learned extensions (Guo et al., 2024, Baldassari et al., 1 Feb 2026, Cordero-Encinar et al., 13 Feb 2025, Cattiaux et al., 13 Nov 2025).