COOL-SD: Annealed Relaxation in Speculative Decoding

Updated 21 January 2026

COOL-SD is a speculative decoding method that combines theoretical analysis, optimal resampling, and an annealing-based acceptance schedule to accelerate autoregressive image generation.
It utilizes an exponentially decaying acceptance function to control the trade-off between speed and fidelity, ensuring minimal bias through an optimal correction kernel.
Empirical evaluations demonstrate that COOL-SD reduces latency and outperforms existing methods like LANTERN++ by achieving higher throughput with comparable output quality.

COOL-SD ("Annealed Relaxation of Speculative Decoding") is a method for accelerating autoregressive (AR) image generation by unifying theoretical analysis, optimal resampling, and an annealing-based acceptance schedule in speculative decoding. COOL-SD arises in the context of AR decoders, where sequential sampling of image tokens leads to significant inference latency. The method is designed to maintain high sample fidelity while significantly increasing decoding throughput, specifically outperforming existing relaxed speculative decoding approaches in both speed and output quality (Li et al., 14 Jan 2026).

1. Problem Setting and Motivation

Autoregressive image generators factor the joint distribution of tokens $x_{1:T}$ as $P(x_{1:T}) = \prod_{t=1}^T P(x_t | x_{1:t-1})$ . Each generation step entails a forward pass, which incurs prohibitive latency when $T$ is large. Speculative Decoding (SD) addresses this through a two-model paradigm: a slow, high-quality target model $P$ and a faster, lower-quality draft model $Q$ trained to mimic $P$ . SD proceeds by drafting $L$ candidate tokens with $Q$ , then verifying and potentially accepting them via $P$ in parallel. In lossless SD, only tokens that perfectly match $P$ 's likely outputs under a Metropolis-Hastings-like criterion are accepted, often resulting in low acceptance rates due to the heavy-tailed, ambiguous nature of image token distributions.

Recent "relaxed" SD variants (e.g., Lantern/Lantern++) improve throughput by relaxing the acceptance criterion, but lack theoretical control over the induced distributional bias. COOL-SD addresses this gap directly by deriving an explicit bound on the total variation (TV) distance between the relaxed output and $P$ , characterizing the optimal correction distribution, and introducing an annealing-based schedule to further minimize bias at fixed throughput.

2. Theoretical Analysis and TV Distance Bound

COOL-SD formalizes relaxed speculative decoding by introducing tokenwise acceptance functions $f_i(x_{1:i}) \in [0,1]$ and tokenwise correction kernels $G_i(\cdot | x_{1:i-1})$ . The fidelity loss is measured via the TV distance between the extended distributions generated by the relaxed SD protocol, denoted $\hat{Q}$ , and the target $P$ : $\mathrm{TV}(\hat Q, \hat P) = \frac12 \sum_{x_{1:L+1}} \left| \hat Q(x_{1:L+1}) - \hat P(x_{1:L+1}) \right|.$ Theorem 3.1 provides a nearly tight upper bound for general acceptance and resampling schedules, showing that the bound depends on the cumulative difference between the (possibly relaxed) draft and target distributions along with the corrective mass "re-injected" by $G_i$ . This characterization enables, for the first time, quantitative control over the speed-fidelity trade-off in relaxed SD.

The key insight is that for any given acceptance schedule, the optimal correction distribution at each token $i+1$ is

$G^*_{i+1}(x|x_{1:i}) = \frac{[P(x|x_{1:i}) - Q(x|x_{1:i}) f_{i+1}(x_{1:i+1})]_+}{\sum_y [P(y|x_{1:i}) - Q(y|x_{1:i}) f_{i+1}(\dots)]_+},$

where $[\cdot]_+$ denotes positive part and normalization ensures $G^*$ is a valid distribution.

3. Annealing Schedules via Perturbation Analysis

A perturbative analysis (Proposition 3.3) investigates the impact of perturbing the tokenwise acceptance functions $\{f_i\}$ while keeping expected accepted block length fixed. The analysis demonstrates that increasing the acceptance rates early in the block (i.e., accepting more initial tokens with higher probability) while decreasing them late leads to a smaller increase in TV distance than the reverse strategy. Extending this result, an optimally "annealed" acceptance schedule is derived, with token relaxation parameters $\omega_1 \geq \omega_2 \geq \cdots \geq \omega_L$ decreasing monotonically through the drafted block. This exponential schedule is found to align with minimizing bias for a given compute budget and naturally favors early, higher-confidence acceptances.

4. COOL-SD Algorithm Specification

COOL-SD implements relaxed speculative decoding as follows:

Acceptance: For each token $i$ in the candidate block,

$f_i(x_{1:i}; \omega_i) = \min\left\{1, \frac{\omega_i P(x_i | x_{1:i-1})}{Q(x_i | x_{1:i-1})}\right\},$

with $\omega_i$ decaying exponentially:

$\omega_i = \delta \exp(-\nu i - \mu),\quad \mu\ \text{set so that}\ \sum_{i=1}^L \exp(-\nu i - \mu) = L,$

where $\delta \geq 1$ is a "relaxation budget" and $\nu$ is the decay rate.

Resampling: The optimal correction kernel is implemented as

$G_{i}^*(x|x_{1:i-1}) = \mathrm{Norm}\left([P(x|x_{1:i-1}) - Q(x|x_{1:i-1}) f_i]_+\right).$

Algorithmic Loop:

Draft $L$ tokens sequentially from $Q$ .
Evaluate $P$ in parallel for all $L$ prefixes.
For $i = 1, \dots, L$ : accept $x_i$ with probability $f_i$ ; upon rejection, resample $x_i$ from $G_i^*$ and terminate the block.
If all $L$ tokens are accepted, sample one additional "bonus" token from $P$ .
Append accepted prefix to output and repeat until completion.

The parameter $\delta$ modulates aggressiveness: larger $\delta$ results in higher relaxation (lower fidelity, higher speedup). Exponential decay is motivated by the perturbation analysis. The draft model $Q$ is trained by minimizing loss against $P$ over paired caption–image data.

5. Empirical Evaluation and Comparative Results

COOL-SD was evaluated on two published AR image generators:

Lumina-mGPT (7B parameters)
LlamaGen-XL (775M parameters)

5,000 MS-COCO validation captions were used as prompts, with decoding executed on NVIDIA A100 GPUs. The draft $Q$ is a trained Eagle-1 static tree. Metrics include FID (Fréchet Inception Distance), CLIP-based text–image score, ImageReward (learned ranker for human preference), mean accepted block length, and latency.

Results for Lumina-mGPT (7B):

Method	FID ↓	CLIP ↑	Accepted Len ↑	Latency/s ↓	Speed-up × ↑
Target (no SD)	28.99	0.3330	1.00	170.14	1.00
Eagle-1	29.05	0.3330	2.76	71.66	2.37
LANTERN++ (λ=2, k=10)	30.31	0.3328	2.99	68.64	2.48
COOL-SD (δ=1.1)	30.30	0.3325	3.11	63.24	2.69

Results for LlamaGen-XL (775M):

Method	FID ↓	CLIP ↑	Accepted Len ↑	Latency/s ↓	Speed-up × ↑
Target (no SD)	21.08	0.3162	1.00	10.11	1.00
Eagle-1	20.97	0.3157	2.42	4.99	2.03
LANTERN++	21.17	0.3157	2.67	4.70	2.15
COOL-SD (δ=1.1)	21.02	0.3167	2.73	4.46	2.27
COOL-SD (δ=2.0)	21.20	0.3154	3.34	3.72	2.72

Key observations:

For fixed FID, COOL-SD yields longer accepted sequences and lower latency than LANTERN++.
Varying $\delta$ enables precise control of the speed–fidelity frontier, with speedups up to $\times$ 3.7 without catastrophic quality loss.
The exponential annealing schedule outperforms uniform relaxations, and substituting LANTERN++'s heuristic $G$ with the TV-optimal $G^*$ yields further improvements.
At comparable speedups, COOL-SD maintains sharp, semantically accurate outputs, whereas other relaxed methods begin to degrade.

6. Relationship to Prior Work and Implications

COOL-SD generalizes and subsumes existing relaxed speculative decoding approaches by providing tight theoretical guarantees on output bias (as measured by TV distance) and specifying optimal correction rules. When restricted to $f_i(x) = \min\{1, P/Q\}$ , the vanilla (lossless) SD kernel $G_{i+1}^{van}$ is recovered as a special case. The method provides a plug-and-play improvement for any system employing speculative decoding where model-mimicry and acceptance rate bottlenecks are encountered. A plausible implication is that the annealed acceptance technique could transfer to other sequence modeling domains with similar compositional structure.

7. Practical Considerations and Future Outlook

COOL-SD's design allows for simple control via two hyperparameters: $\delta$ (relaxation budget) and $\nu$ (annealing rate), with a fixed $\nu \approx 0.7$ and $\delta$ in $[1.1, 2.0]$ covering most practical settings. Selection of these parameters requires minimal tuning for deployment, and the offline computation of normalization constants ( $\mu$ ) further streamlines adaptation. The method's efficacy with both large (Lumina-mGPT) and modest-sized (LlamaGen-XL) models on standard benchmarks demonstrates its versatility. While developed for AR image generation, the underlying math holds for AR token sequences generally, suggesting broad applicability wherever speculative decoding and fidelity/speed trade-offs are relevant (Li et al., 14 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Annealed Relaxation of Speculative Decoding for Faster Autoregressive Image Generation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to COOL-SD.