Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cosine Noise Schedule in Diffusion Models

Updated 5 January 2026
  • Cosine noise schedule is a mathematically defined protocol that controls noise injection in diffusion models via a cosine-squared decay, ensuring robust learning dynamics and sample fidelity.
  • It derives from information-geometric principles, serving as the Fisher–Rao geodesic optimal schedule to equitably manage the signal-to-noise ratio over diffusion steps.
  • Comparative analyses show the cosine schedule improves convergence speed and sample quality compared to linear and quadratic schedules, especially at high resolutions.

A cosine noise schedule is a mathematically-defined protocol for controlling noise injection in the training and sampling phases of @@@@1@@@@. Its appeal lies in rigorous connections to information geometry, optimality criteria, tractable analytic formulation, and robust empirical performance across a variety of tasks and architectures. The schedule specifies how the variance (or equivalently, the signal-to-noise ratio) evolves across discrete or continuous time steps, shaping both the learning dynamics and the fidelity of synthesized samples.

1. Mathematical Formulation

The core of the cosine noise schedule is the definition of the cumulative “signal preservation” parameter αˉt\bar\alpha_t and its related variance parameter βt\beta_t. For TT total diffusion steps, define

f(t)=cos2(t/T+s1+sπ2),t[0,T]f(t) = \cos^2 \left( \frac{t/T + s}{1+s} \cdot \frac{\pi}{2} \right), \quad t \in [0, T]

where s>0s > 0 is a small offset, typically s0.008s \approx 0.008 to $0.2$, introduced to circumvent singularity and numerical instabilities at the initial time step. The normalized schedule is given by

αˉt=f(t)f(0),αˉ0=1\bar\alpha_t = \frac{f(t)}{f(0)}, \quad \bar\alpha_0 = 1

and the per-step update is

αt=1βt,βt=1αˉtαˉt1\alpha_t = 1 - \beta_t, \qquad \beta_t = 1 - \frac{\bar\alpha_t}{\bar\alpha_{t-1}}

This formulation allows a smooth, symmetric decay of the signal-to-noise ratio over time, with the steepest changes centered around the mid-timesteps (Guo et al., 7 Feb 2025, Santos et al., 2023, Strasman et al., 2024).

2. Information-Geometric Optimality

The cosine schedule is not merely heuristic or empirical; it arises as the Fisher–Rao-geodesic optimal schedule in the space of probability distributions induced by forward diffusion. In masked discrete diffusion models,

  • The marginal path tqtt \mapsto q_t lies on the simplex.
  • The Fisher–Rao metric I(t)=Extqt[(tlogqt(xt))2]I(t) = \mathbb{E}_{x_t\sim q_t} [(\partial_t\log q_t(x_t))^2] quantifies infinitesimal statistical distinguishability. Solving for the minimum path length (a constant “speed”) yields the closed-form solution

α(t)=cos2(π2t)\alpha(t) = \cos^2(\tfrac{\pi}{2}\,t)

and its discretized variant αi=cos2(iπ2T)\alpha_i = \cos^2(i\frac{\pi}{2T}), given i=0,,Ti=0,\dots,T (Zhang, 6 Aug 2025, Santos et al., 2023). This information-geometric derivation anchors the schedule in optimal transport and learning efficiency principles.

3. Connections to Ornstein–Uhlenbeck Process

A formal equivalence exists between variance-preserving DDPMs and time-homogeneous OU processes observed at non-uniform times. Viewing the diffusion forward process as OU dynamics,

dXt=Xtdt+2dWtdX_t = -X_t\,dt + \sqrt{2}\,dW_t

appropriately chosen observation times tkt_k induce the cosine schedule via Fisher information equalization. In detail, mapping observation density to

π(θ)11θ2,θ=et\pi(\theta) \propto \frac{1}{\sqrt{1-\theta^2}}, \quad \theta = e^{-t}

and inverting gives

αˉk=cos2(kπ2T)\bar\alpha_k = \cos^2 \left( \frac{k\pi}{2T} \right)

This matches the empirical regime where sample quality and learning efficiency are optimal (Santos et al., 2023).

4. Comparative Analysis with Alternative Schedules

Other schedules—including linear, quadratic, exponential, sigmoid, Laplace, and Cauchy—exhibit distinctive signal-to-noise decay profiles:

  • Linear spreads noise increase evenly, but places excessive “difficulty” at early steps.
  • Quadratic/exponential concentrate noise at boundaries.
  • Cosine delays challenging denoising to the midpoint, allowing the model to learn trivial tasks first and focusing computational effort on the “difficulty region.”
  • Optimized Laplace/Cauchy schedules, which concentrate mass near log\logSNR=$0$, have recently shown improved performance over cosine in both convergence speed and final FID (Hang et al., 2024). The cosine schedule—with tuned offset ss and exponent τ\tau—remains a widely effective, robust, and computationally tractable baseline across resolutions and architectures (Guo et al., 7 Feb 2025, Strasman et al., 2024).
Schedule Type Noise Concentration Empirical Quality (FID)
Linear Uniform Degraded at high res
Cosine (s=0.2s=0.2, τ=2\tau=2) Midpoint Improved at 2562256^2 and above
Laplace Centered near log\logSNR=0 Superior (best at CFG=3.0)
Cauchy Mid-to-high SNR Comparable or better

5. Empirical Effects and Performance

Extensive evaluation reveals distinct advantages:

  • Convergence speed: Cosine and Laplace schedules reach target FID in fewer iterations; Laplace accelerates even further (Hang et al., 2024).
  • Sample quality: Cosine produces sharper, more uniform samples across time steps compared to linear, especially at high resolutions (Guo et al., 7 Feb 2025, Strasman et al., 2024).
  • Robustness: Benefits accrue independently of prediction target (noise, data, or “velocity”) within the model.
  • Tuning: Optimal cosine offsets (e.g., s0.010.2s\sim0.01–0.2) and exponents yield empirical FID improvements, with adaptive tuning algorithms lowering FID/KL error 10–30% versus fixed schedules (Strasman et al., 2024).
  • Numerical stability: Small initial βt\beta_t and smooth slope avoid gradient blow-up and overfitting at very small noise levels.

6. Practical Guidelines for Implementation and Tuning

Recommended procedures include:

  • Use a small offset ss (e.g., s=0.008s=0.008 for typical image sizes, up to s=0.2s=0.2 for very high resolutions).
  • For largest images or instability at early steps, consider sigmoid or Laplace as alternatives.
  • Monitor surrogate upper bounds L(s,θ)L(s, \theta) for tuning, and cross-reference held-out FID/KL metrics for convergence (Strasman et al., 2024, Guo et al., 7 Feb 2025).
  • When possible, employ adaptive gradient-based tuning for ss or Laplace scale parameters to further reduce sample error.
  • Always compare against a tuned linear baseline to validate practical improvements.

7. Current Advances and Theoretical Extensions

While the cosine schedule has been empirically successful, recent theoretical and experimental work emphasizes importance sampling in logSNR\log\,\text{SNR} space. For instance, Laplace-centered schedules, which increase sampling frequency near logSNR=0\log\,\text{SNR}=0, yield improved convergence and robustness, particularly on large-scale benchmarks such as ImageNet. This shift in focus recognizes that sub-tasks at mid-range SNR contribute the most informative gradients, and reallocation of sampling density outperforms simple loss reweighting. Empirical ablations confirm superior FID at both 2562256^2 and 5122512^2 resolution under these importance-sampled schedules (Hang et al., 2024).

In summary, the cosine noise schedule represents a theoretically justified, empirically robust, and computationally tractable protocol for noise control in diffusion models. Its analytic form, geometric optimality, and proven performance profile make it a standard in generative modeling, although recent variants such as Laplace and Cauchy schedules provide appealing improvements in constrained regimes.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cosine Noise Schedule.