Papers
Topics
Authors
Recent
Search
2000 character limit reached

COOL-SD: Annealed Relaxation in Speculative Decoding

Updated 21 January 2026
  • COOL-SD is a speculative decoding method that combines theoretical analysis, optimal resampling, and an annealing-based acceptance schedule to accelerate autoregressive image generation.
  • It utilizes an exponentially decaying acceptance function to control the trade-off between speed and fidelity, ensuring minimal bias through an optimal correction kernel.
  • Empirical evaluations demonstrate that COOL-SD reduces latency and outperforms existing methods like LANTERN++ by achieving higher throughput with comparable output quality.

COOL-SD ("Annealed Relaxation of Speculative Decoding") is a method for accelerating autoregressive (AR) image generation by unifying theoretical analysis, optimal resampling, and an annealing-based acceptance schedule in speculative decoding. COOL-SD arises in the context of AR decoders, where sequential sampling of image tokens leads to significant inference latency. The method is designed to maintain high sample fidelity while significantly increasing decoding throughput, specifically outperforming existing relaxed speculative decoding approaches in both speed and output quality (Li et al., 14 Jan 2026).

1. Problem Setting and Motivation

Autoregressive image generators factor the joint distribution of tokens x1:Tx_{1:T} as P(x1:T)=t=1TP(xtx1:t1)P(x_{1:T}) = \prod_{t=1}^T P(x_t | x_{1:t-1}). Each generation step entails a forward pass, which incurs prohibitive latency when TT is large. Speculative Decoding (SD) addresses this through a two-model paradigm: a slow, high-quality target model PP and a faster, lower-quality draft model QQ trained to mimic PP. SD proceeds by drafting LL candidate tokens with QQ, then verifying and potentially accepting them via PP in parallel. In lossless SD, only tokens that perfectly match PP's likely outputs under a Metropolis-Hastings-like criterion are accepted, often resulting in low acceptance rates due to the heavy-tailed, ambiguous nature of image token distributions.

Recent "relaxed" SD variants (e.g., Lantern/Lantern++) improve throughput by relaxing the acceptance criterion, but lack theoretical control over the induced distributional bias. COOL-SD addresses this gap directly by deriving an explicit bound on the total variation (TV) distance between the relaxed output and PP, characterizing the optimal correction distribution, and introducing an annealing-based schedule to further minimize bias at fixed throughput.

2. Theoretical Analysis and TV Distance Bound

COOL-SD formalizes relaxed speculative decoding by introducing tokenwise acceptance functions fi(x1:i)[0,1]f_i(x_{1:i}) \in [0,1] and tokenwise correction kernels Gi(x1:i1)G_i(\cdot | x_{1:i-1}). The fidelity loss is measured via the TV distance between the extended distributions generated by the relaxed SD protocol, denoted Q^\hat{Q}, and the target PP: TV(Q^,P^)=12x1:L+1Q^(x1:L+1)P^(x1:L+1).\mathrm{TV}(\hat Q, \hat P) = \frac12 \sum_{x_{1:L+1}} \left| \hat Q(x_{1:L+1}) - \hat P(x_{1:L+1}) \right|. Theorem 3.1 provides a nearly tight upper bound for general acceptance and resampling schedules, showing that the bound depends on the cumulative difference between the (possibly relaxed) draft and target distributions along with the corrective mass "re-injected" by GiG_i. This characterization enables, for the first time, quantitative control over the speed-fidelity trade-off in relaxed SD.

The key insight is that for any given acceptance schedule, the optimal correction distribution at each token i+1i+1 is

Gi+1(xx1:i)=[P(xx1:i)Q(xx1:i)fi+1(x1:i+1)]+y[P(yx1:i)Q(yx1:i)fi+1()]+,G^*_{i+1}(x|x_{1:i}) = \frac{[P(x|x_{1:i}) - Q(x|x_{1:i}) f_{i+1}(x_{1:i+1})]_+}{\sum_y [P(y|x_{1:i}) - Q(y|x_{1:i}) f_{i+1}(\dots)]_+},

where []+[\cdot]_+ denotes positive part and normalization ensures GG^* is a valid distribution.

3. Annealing Schedules via Perturbation Analysis

A perturbative analysis (Proposition 3.3) investigates the impact of perturbing the tokenwise acceptance functions {fi}\{f_i\} while keeping expected accepted block length fixed. The analysis demonstrates that increasing the acceptance rates early in the block (i.e., accepting more initial tokens with higher probability) while decreasing them late leads to a smaller increase in TV distance than the reverse strategy. Extending this result, an optimally "annealed" acceptance schedule is derived, with token relaxation parameters ω1ω2ωL\omega_1 \geq \omega_2 \geq \cdots \geq \omega_L decreasing monotonically through the drafted block. This exponential schedule is found to align with minimizing bias for a given compute budget and naturally favors early, higher-confidence acceptances.

4. COOL-SD Algorithm Specification

COOL-SD implements relaxed speculative decoding as follows:

  • Acceptance: For each token ii in the candidate block,

fi(x1:i;ωi)=min{1,ωiP(xix1:i1)Q(xix1:i1)},f_i(x_{1:i}; \omega_i) = \min\left\{1, \frac{\omega_i P(x_i | x_{1:i-1})}{Q(x_i | x_{1:i-1})}\right\},

with ωi\omega_i decaying exponentially:

ωi=δexp(νiμ),μ set so that i=1Lexp(νiμ)=L,\omega_i = \delta \exp(-\nu i - \mu),\quad \mu\ \text{set so that}\ \sum_{i=1}^L \exp(-\nu i - \mu) = L,

where δ1\delta \geq 1 is a "relaxation budget" and ν\nu is the decay rate.

  • Resampling: The optimal correction kernel is implemented as

Gi(xx1:i1)=Norm([P(xx1:i1)Q(xx1:i1)fi]+).G_{i}^*(x|x_{1:i-1}) = \mathrm{Norm}\left([P(x|x_{1:i-1}) - Q(x|x_{1:i-1}) f_i]_+\right).

  • Algorithmic Loop:
  1. Draft LL tokens sequentially from QQ.
  2. Evaluate PP in parallel for all LL prefixes.
  3. For i=1,,Li = 1, \dots, L: accept xix_i with probability fif_i; upon rejection, resample xix_i from GiG_i^* and terminate the block.
  4. If all LL tokens are accepted, sample one additional "bonus" token from PP.
  5. Append accepted prefix to output and repeat until completion.

The parameter δ\delta modulates aggressiveness: larger δ\delta results in higher relaxation (lower fidelity, higher speedup). Exponential decay is motivated by the perturbation analysis. The draft model QQ is trained by minimizing loss against PP over paired caption–image data.

5. Empirical Evaluation and Comparative Results

COOL-SD was evaluated on two published AR image generators:

  • Lumina-mGPT (7B parameters)
  • LlamaGen-XL (775M parameters)

5,000 MS-COCO validation captions were used as prompts, with decoding executed on NVIDIA A100 GPUs. The draft QQ is a trained Eagle-1 static tree. Metrics include FID (Fréchet Inception Distance), CLIP-based text–image score, ImageReward (learned ranker for human preference), mean accepted block length, and latency.

Results for Lumina-mGPT (7B):

Method FID ↓ CLIP Accepted Len ↑ Latency/s ↓ Speed-up × ↑
Target (no SD) 28.99 0.3330 1.00 170.14 1.00
Eagle-1 29.05 0.3330 2.76 71.66 2.37
LANTERN++ (λ=2, k=10) 30.31 0.3328 2.99 68.64 2.48
COOL-SD (δ=1.1) 30.30 0.3325 3.11 63.24 2.69

Results for LlamaGen-XL (775M):

Method FID ↓ CLIP ↑ Accepted Len ↑ Latency/s ↓ Speed-up × ↑
Target (no SD) 21.08 0.3162 1.00 10.11 1.00
Eagle-1 20.97 0.3157 2.42 4.99 2.03
LANTERN++ 21.17 0.3157 2.67 4.70 2.15
COOL-SD (δ=1.1) 21.02 0.3167 2.73 4.46 2.27
COOL-SD (δ=2.0) 21.20 0.3154 3.34 3.72 2.72

Key observations:

  • For fixed FID, COOL-SD yields longer accepted sequences and lower latency than LANTERN++.
  • Varying δ\delta enables precise control of the speed–fidelity frontier, with speedups up to ×\times3.7 without catastrophic quality loss.
  • The exponential annealing schedule outperforms uniform relaxations, and substituting LANTERN++'s heuristic GG with the TV-optimal GG^* yields further improvements.
  • At comparable speedups, COOL-SD maintains sharp, semantically accurate outputs, whereas other relaxed methods begin to degrade.

6. Relationship to Prior Work and Implications

COOL-SD generalizes and subsumes existing relaxed speculative decoding approaches by providing tight theoretical guarantees on output bias (as measured by TV distance) and specifying optimal correction rules. When restricted to fi(x)=min{1,P/Q}f_i(x) = \min\{1, P/Q\}, the vanilla (lossless) SD kernel Gi+1vanG_{i+1}^{van} is recovered as a special case. The method provides a plug-and-play improvement for any system employing speculative decoding where model-mimicry and acceptance rate bottlenecks are encountered. A plausible implication is that the annealed acceptance technique could transfer to other sequence modeling domains with similar compositional structure.

7. Practical Considerations and Future Outlook

COOL-SD's design allows for simple control via two hyperparameters: δ\delta (relaxation budget) and ν\nu (annealing rate), with a fixed ν0.7\nu \approx 0.7 and δ\delta in [1.1,2.0][1.1, 2.0] covering most practical settings. Selection of these parameters requires minimal tuning for deployment, and the offline computation of normalization constants (μ\mu) further streamlines adaptation. The method's efficacy with both large (Lumina-mGPT) and modest-sized (LlamaGen-XL) models on standard benchmarks demonstrates its versatility. While developed for AR image generation, the underlying math holds for AR token sequences generally, suggesting broad applicability wherever speculative decoding and fidelity/speed trade-offs are relevant (Li et al., 14 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to COOL-SD.