Annealed Langevin Dynamics

Updated 2 February 2026

Annealed Langevin dynamics is a stochastic process that progressively reduces noise to transition from tractable base distributions to complex, multimodal targets.
It leverages time-varying stochastic differential equations and annealing schedules such as logarithmic or geometric cooling for effective exploration and convergence.
Applications include statistical sampling, inverse problems, and score-based generative modeling, supported by polynomial-time convergence guarantees under mild conditions.

Annealed Langevin dynamics is a family of time-inhomogeneous Markov processes that interpolate between tractable reference distributions and complex target distributions using a controlled annealing (cooling) schedule, with applications spanning statistical sampling, global optimization, and score-based generative modeling. The core idea is to leverage the smoothing effects of high noise or high temperature early in the process (facilitating exploration) and then gradually reduce noise to target highly structured, potentially multimodal or constrained target distributions. Algorithmically, this entails simulating variants of Langevin stochastic differential equations (SDEs) with time-varying diffusion coefficients, possibly including higher-order dynamics or advanced discretization schemes. Recent work provides polynomial-time convergence guarantees under minimal assumptions, robustness to imperfect score estimation, and practical effectiveness for high-dimensional inverse problems, non-log-concave sampling, discrete combinatorial spaces, and generative modeling.

1. Mathematical Foundations and Core SDE Formulation

The general annealed Langevin process extends the standard overdamped Langevin SDE:

$dX_t = -\nabla f(X_t)\,dt + \sqrt{2T}\,dW_t$

by introducing a time-dependent temperature schedule $T(t)$ , equivalently an inverse temperature $\beta(t)=1/T(t)$ . The annealed SDE thus becomes:

$dX_t = -\nabla f(X_t) \, dt + \sqrt{2T(t)}\, dW_t$

or, in the potential-based (probabilistic) setting:

$dX_t = \nabla\log p_t(X_t)\,dt + \sqrt{2}\,dW_t,$

where $p_t$ interpolates from a simple base law $p_0$ to the target $p_1$ (possibly via convolutional or score-based interpolation) (Pareschi, 2024, Cordero-Encinar et al., 13 Feb 2025, Cattiaux et al., 13 Nov 2025). The dynamics for underdamped or higher-order Langevin processes involve auxiliary momentum or memory variables, with Ornstein-Uhlenbeck noise and additional damping/friction terms (Zilberstein et al., 2023, Chak et al., 2020, Zilberstein et al., 2022).

The corresponding Fokker-Planck equation for the law $\rho_t$ is

$\partial_t\,\rho_t(x) = \nabla\cdot(\rho_t(x)\,\nabla f(x)) + T(t)\,\Delta \rho_t(x)$

where again the time-inhomogeneous diffusion ensures that the process evolves toward the global minimum as $T(t)$ 0 (Pareschi, 2024).

2. Annealing Schedules and Interpolating Distributions

A fundamental component is the calibration of the annealing schedule. The most widely used schedule is logarithmic cooling:

$T(t)$ 1

or geometric schedules for interpolating the noise variance in discrete levels,

$T(t)$ 2

with $T(t)$ 3 (Zilberstein et al., 2022, Chak et al., 2020, Zilberstein et al., 2022, Cordero-Encinar et al., 13 Feb 2025). These schedules balance exploration in the high-temperature (noisier) regime and rapid convergence in the low-temperature phase.

Interpolating distributions $T(t)$ 4 are designed either as

Gibbs measures $T(t)$ 5 (direct cooling of the potential),
score-based convolutional paths $T(t)$ 6 with blending coefficient, e.g., $T(t)$ 7 for $T(t)$ 8 and $T(t)$ 9 (Cattiaux et al., 13 Nov 2025, Cordero-Encinar et al., 13 Feb 2025),
or a sequence of "smoothed" posteriors $\beta(t)=1/T(t)$ 0 (Xun et al., 30 Oct 2025, Zilberstein et al., 2022).

The action of the interpolating path, or the metric speed in Wasserstein space, quantifies the difficulty of the schedule and directly enters KL bounds and oracle complexity (Guo et al., 2024, Cordero-Encinar et al., 13 Feb 2025).

3. Algorithmic Schemes and Discretizations

Practical annealed Langevin algorithms are implemented via discretizations of the time-inhomogeneous SDEs, often with adaptation to the annealed schedule:

Overdamped Euler–Maruyama updates with time- or level-dependent step sizes and diffusion scales,
Preconditioning and splitting methods for higher-order (underdamped, generalized) Langevin (Zilberstein et al., 2023, Zilberstein et al., 2022, Chak et al., 2020),
Score-based methods: at each level, unnormalized scores are replaced by empirical or learned approximations (neural denoisers, MMSE denoising, or explicitly parameterized score networks) (Xun et al., 30 Oct 2025, Zilberstein et al., 2022, Zilberstein et al., 2022, Shaked et al., 21 Oct 2025),
Diffusion model samplers: sampling along a path defined by Gaussian or heavy-tailed diffusion, using learned denoising score estimators at multiple noise levels (Cordero-Encinar et al., 13 Feb 2025).

The generic update at level $\beta(t)=1/T(t)$ 1 is:

$\beta(t)=1/T(t)$ 2

with $\beta(t)=1/T(t)$ 3, annealed $\beta(t)=1/T(t)$ 4, and $\beta(t)=1/T(t)$ 5 computed analytically, via Tweedie's identity, or by neural approximation (Zilberstein et al., 2022, Xun et al., 30 Oct 2025).

For higher-order schemes, state evolution involves position, momentum, and possibly memory variables, with operator splitting (e.g., BAOAB) employed for numerical stability and accuracy (Zilberstein et al., 2023, Zilberstein et al., 2022).

In neural-infused "unfolded" variants, the iterative ALD chain is unwound into a fixed-depth DNN for end-to-end learning and low-latency inference (Shaked et al., 21 Oct 2025).

4. Theoretical Properties, Complexity, and Robustness

Annealed Langevin methods admit rigorous convergence analyses under a range of assumptions:

For convex or strongly log-concave targets, polynomial-time mixing is obtained in total variation or KL, with step complexity polynomial in data dimension, smoothness, and action parameter (Guo et al., 2024, Xun et al., 30 Oct 2025, Cordero-Encinar et al., 13 Feb 2025).
With only $\beta(t)=1/T(t)$ 6-smooth potentials, accurate sampling can be obtained for possibly multimodal, non-log-concave, non-isoperimetric targets, albeit at higher polynomial cost (Guo et al., 2024, Cordero-Encinar et al., 13 Feb 2025).
For simulated annealing in nonconvex settings (with energy barriers $\beta(t)=1/T(t)$ 7), cooling at the optimal $\beta(t)=1/T(t)$ 8 schedule ensures polynomial decay to the global optimum, with exponents directly reflecting the landscape's barrier structure (He et al., 2022, Chak et al., 2020).
Under higher-order schemes, convergence rates improve by replacing condition-number dependence (e.g., from $\beta(t)=1/T(t)$ 9 to $dX_t = -\nabla f(X_t) \, dt + \sqrt{2T(t)}\, dW_t$ 0 or $dX_t = -\nabla f(X_t) \, dt + \sqrt{2T(t)}\, dW_t$ 1) in mixing time bounds, consistent with underdamped Langevin and kinetic-theory predictions (Zilberstein et al., 2023).
Robustness to score-approximation error is enhanced: when employing diffusion-based initialization and annealed Langevin updates, convergence can be controlled under an $dX_t = -\nabla f(X_t) \, dt + \sqrt{2T(t)}\, dW_t$ 2 (fourth-moment) score error, instead of the exponentially strong MGF bound required by plain Langevin posterior sampling (Xun et al., 30 Oct 2025). This makes ALD fundamentally more tolerant of misspecified or neural-approximated gradients.

5. Applications: Inverse Problems, Optimization, and Generative Modeling

Annealed Langevin dynamics is central to several advanced applications:

Linear Inverse Problems / Posterior Sampling: Used for sampling from posteriors conditioned on linear observations $dX_t = -\nabla f(X_t) \, dt + \sqrt{2T(t)}\, dW_t$ 3 under log-concave or locally log-concave priors. The methodology combines unconditional diffusion initialization and an annealed sequence of Langevin updates, with theoretical guarantees on convergence and computational complexity (Xun et al., 30 Oct 2025, Zilberstein et al., 2023).
Massive MIMO Detection: ALD and its underdamped variants are used for symbol detection, with an annealed sequence of prior smoothings to transition from globally smoothed to constellation-constrained symbols. Empirical results demonstrate state-of-the-art symbol error rates, outperforming classical or deep learning methods (Zilberstein et al., 2022, Zilberstein et al., 2022, Zilberstein et al., 2022).
Score-based Generative Modeling: DALMC and related samplers interpolate between a base distribution (Gaussian or Student-t) and data; learned denoising-score networks provide the vector field, and annealed Langevin is used for reverse-time sampling. All pathwise and stationary performance bounds are quantified in KL divergence with polynomial dependence on the action (Wasserstein path length) (Cordero-Encinar et al., 13 Feb 2025, Cattiaux et al., 13 Nov 2025).
Non-parallel Voice Conversion: VoiceGrad applies annealed Langevin sampling over learned score fields (level-dependent smoothings) for sequence data, producing high-quality mel-spectrograms for arbitrary input-output speaker pairs (Kameoka et al., 2020).
Channel Optimization and AI Unfolding: RIS-aided channel optimization leverages deep-unfolded ALD with neural denoising, trained via zero-order gradients and active sampling (for generalization), enabling fast, robust tuning in high-dimensional, practical wireless environments (Shaked et al., 21 Oct 2025).

6. Structural Guarantees and Functional Inequalities

The well-posedness and efficiency of annealed Langevin processes are intimately connected to functional-inequality properties:

Existence, uniqueness, and ergodicity of the process flow require uniform Poincaré (and when possible, logarithmic Sobolev) inequalities along the path of intermediate, "noised" conditional distributions (Cattiaux et al., 13 Nov 2025).
If a uniform log-Sobolev constant can be established across the annealing path, exponential mixing and small (e.g., $dX_t = -\nabla f(X_t) \, dt + \sqrt{2T(t)}\, dW_t$ 4) bias are achieved in KL divergence; otherwise, only a polynomial or $dX_t = -\nabla f(X_t) \, dt + \sqrt{2T(t)}\, dW_t$ 5 guarantee is obtained (Cattiaux et al., 13 Nov 2025, Cordero-Encinar et al., 13 Feb 2025).
Recent results clarify that in highly non-log-concave (heavy-tailed or multimodal) settings, annealed Langevin is among the few sampling techniques with provable convergence in polynomial time and explicit error bounds, up to the action constant (Guo et al., 2024).

7. Discretization, Hyperparameters, and Implementation Considerations

Discretization choices affect stability and efficiency:

Step-size selection for each annealing level is critical; splitting integrators such as BAOAB are typically employed for underdamped or high-order methods (Zilberstein et al., 2023, Zilberstein et al., 2022, He et al., 2022).
For high-dimensional applications, per-level spectral preconditioning, adaptive step-matrices, and warm starts (e.g., from diffusion samples or from previous levels) are essential for practical convergence.
Complexity per trajectory is typically $dX_t = -\nabla f(X_t) \, dt + \sqrt{2T(t)}\, dW_t$ 6, where $dX_t = -\nabla f(X_t) \, dt + \sqrt{2T(t)}\, dW_t$ 7 is the number of annealing levels, $dX_t = -\nabla f(X_t) \, dt + \sqrt{2T(t)}\, dW_t$ 8 is steps per level, and $dX_t = -\nabla f(X_t) \, dt + \sqrt{2T(t)}\, dW_t$ 9 is the dominant cost of score evaluation or denoising (Zilberstein et al., 2022, Zilberstein et al., 2023). Parallel sampling and model selection among independent trajectories improve reliability.
Neural score models are trained with denoising-score matching, typically using geometric or learned step-size and noise-scale schedules (Kameoka et al., 2020, Cordero-Encinar et al., 13 Feb 2025).
Discrete sampling in combinatorial or structured spaces leverages annealed smoothing for global exploration and sharp "projection" at low-noise, exploiting the ability of ALD to cross local minima barriers and efficiently lock on valid configurations (Zilberstein et al., 2022, Zilberstein et al., 2022).

In summary, annealed Langevin dynamics constitutes a theoretically grounded and algorithmically flexible framework for sampling, optimization, and generative modeling in complex, high-dimensional, and possibly nonconvex settings. Leveraging explicit annealing schedules, score-based or higher-order extensions, and recent advances in convergence theory, ALD bridges fundamental connections between statistical physics, kinetic theory, and modern machine learning applications (Xun et al., 30 Oct 2025, Cordero-Encinar et al., 13 Feb 2025, Guo et al., 2024, Cattiaux et al., 13 Nov 2025).