COOL-SD: Annealed Relaxation in Speculative Decoding
- COOL-SD is a speculative decoding method that combines theoretical analysis, optimal resampling, and an annealing-based acceptance schedule to accelerate autoregressive image generation.
- It utilizes an exponentially decaying acceptance function to control the trade-off between speed and fidelity, ensuring minimal bias through an optimal correction kernel.
- Empirical evaluations demonstrate that COOL-SD reduces latency and outperforms existing methods like LANTERN++ by achieving higher throughput with comparable output quality.
COOL-SD ("Annealed Relaxation of Speculative Decoding") is a method for accelerating autoregressive (AR) image generation by unifying theoretical analysis, optimal resampling, and an annealing-based acceptance schedule in speculative decoding. COOL-SD arises in the context of AR decoders, where sequential sampling of image tokens leads to significant inference latency. The method is designed to maintain high sample fidelity while significantly increasing decoding throughput, specifically outperforming existing relaxed speculative decoding approaches in both speed and output quality (Li et al., 14 Jan 2026).
1. Problem Setting and Motivation
Autoregressive image generators factor the joint distribution of tokens as . Each generation step entails a forward pass, which incurs prohibitive latency when is large. Speculative Decoding (SD) addresses this through a two-model paradigm: a slow, high-quality target model and a faster, lower-quality draft model trained to mimic . SD proceeds by drafting candidate tokens with , then verifying and potentially accepting them via in parallel. In lossless SD, only tokens that perfectly match 's likely outputs under a Metropolis-Hastings-like criterion are accepted, often resulting in low acceptance rates due to the heavy-tailed, ambiguous nature of image token distributions.
Recent "relaxed" SD variants (e.g., Lantern/Lantern++) improve throughput by relaxing the acceptance criterion, but lack theoretical control over the induced distributional bias. COOL-SD addresses this gap directly by deriving an explicit bound on the total variation (TV) distance between the relaxed output and , characterizing the optimal correction distribution, and introducing an annealing-based schedule to further minimize bias at fixed throughput.
2. Theoretical Analysis and TV Distance Bound
COOL-SD formalizes relaxed speculative decoding by introducing tokenwise acceptance functions and tokenwise correction kernels . The fidelity loss is measured via the TV distance between the extended distributions generated by the relaxed SD protocol, denoted , and the target : Theorem 3.1 provides a nearly tight upper bound for general acceptance and resampling schedules, showing that the bound depends on the cumulative difference between the (possibly relaxed) draft and target distributions along with the corrective mass "re-injected" by . This characterization enables, for the first time, quantitative control over the speed-fidelity trade-off in relaxed SD.
The key insight is that for any given acceptance schedule, the optimal correction distribution at each token is
where denotes positive part and normalization ensures is a valid distribution.
3. Annealing Schedules via Perturbation Analysis
A perturbative analysis (Proposition 3.3) investigates the impact of perturbing the tokenwise acceptance functions while keeping expected accepted block length fixed. The analysis demonstrates that increasing the acceptance rates early in the block (i.e., accepting more initial tokens with higher probability) while decreasing them late leads to a smaller increase in TV distance than the reverse strategy. Extending this result, an optimally "annealed" acceptance schedule is derived, with token relaxation parameters decreasing monotonically through the drafted block. This exponential schedule is found to align with minimizing bias for a given compute budget and naturally favors early, higher-confidence acceptances.
4. COOL-SD Algorithm Specification
COOL-SD implements relaxed speculative decoding as follows:
- Acceptance: For each token in the candidate block,
with decaying exponentially:
where is a "relaxation budget" and is the decay rate.
- Resampling: The optimal correction kernel is implemented as
- Algorithmic Loop:
- Draft tokens sequentially from .
- Evaluate in parallel for all prefixes.
- For : accept with probability ; upon rejection, resample from and terminate the block.
- If all tokens are accepted, sample one additional "bonus" token from .
- Append accepted prefix to output and repeat until completion.
The parameter modulates aggressiveness: larger results in higher relaxation (lower fidelity, higher speedup). Exponential decay is motivated by the perturbation analysis. The draft model is trained by minimizing loss against over paired caption–image data.
5. Empirical Evaluation and Comparative Results
COOL-SD was evaluated on two published AR image generators:
- Lumina-mGPT (7B parameters)
- LlamaGen-XL (775M parameters)
5,000 MS-COCO validation captions were used as prompts, with decoding executed on NVIDIA A100 GPUs. The draft is a trained Eagle-1 static tree. Metrics include FID (Fréchet Inception Distance), CLIP-based text–image score, ImageReward (learned ranker for human preference), mean accepted block length, and latency.
Results for Lumina-mGPT (7B):
| Method | FID ↓ | CLIP ↑ | Accepted Len ↑ | Latency/s ↓ | Speed-up × ↑ |
|---|---|---|---|---|---|
| Target (no SD) | 28.99 | 0.3330 | 1.00 | 170.14 | 1.00 |
| Eagle-1 | 29.05 | 0.3330 | 2.76 | 71.66 | 2.37 |
| LANTERN++ (λ=2, k=10) | 30.31 | 0.3328 | 2.99 | 68.64 | 2.48 |
| COOL-SD (δ=1.1) | 30.30 | 0.3325 | 3.11 | 63.24 | 2.69 |
Results for LlamaGen-XL (775M):
| Method | FID ↓ | CLIP ↑ | Accepted Len ↑ | Latency/s ↓ | Speed-up × ↑ |
|---|---|---|---|---|---|
| Target (no SD) | 21.08 | 0.3162 | 1.00 | 10.11 | 1.00 |
| Eagle-1 | 20.97 | 0.3157 | 2.42 | 4.99 | 2.03 |
| LANTERN++ | 21.17 | 0.3157 | 2.67 | 4.70 | 2.15 |
| COOL-SD (δ=1.1) | 21.02 | 0.3167 | 2.73 | 4.46 | 2.27 |
| COOL-SD (δ=2.0) | 21.20 | 0.3154 | 3.34 | 3.72 | 2.72 |
Key observations:
- For fixed FID, COOL-SD yields longer accepted sequences and lower latency than LANTERN++.
- Varying enables precise control of the speed–fidelity frontier, with speedups up to 3.7 without catastrophic quality loss.
- The exponential annealing schedule outperforms uniform relaxations, and substituting LANTERN++'s heuristic with the TV-optimal yields further improvements.
- At comparable speedups, COOL-SD maintains sharp, semantically accurate outputs, whereas other relaxed methods begin to degrade.
6. Relationship to Prior Work and Implications
COOL-SD generalizes and subsumes existing relaxed speculative decoding approaches by providing tight theoretical guarantees on output bias (as measured by TV distance) and specifying optimal correction rules. When restricted to , the vanilla (lossless) SD kernel is recovered as a special case. The method provides a plug-and-play improvement for any system employing speculative decoding where model-mimicry and acceptance rate bottlenecks are encountered. A plausible implication is that the annealed acceptance technique could transfer to other sequence modeling domains with similar compositional structure.
7. Practical Considerations and Future Outlook
COOL-SD's design allows for simple control via two hyperparameters: (relaxation budget) and (annealing rate), with a fixed and in covering most practical settings. Selection of these parameters requires minimal tuning for deployment, and the offline computation of normalization constants () further streamlines adaptation. The method's efficacy with both large (Lumina-mGPT) and modest-sized (LlamaGen-XL) models on standard benchmarks demonstrates its versatility. While developed for AR image generation, the underlying math holds for AR token sequences generally, suggesting broad applicability wherever speculative decoding and fidelity/speed trade-offs are relevant (Li et al., 14 Jan 2026).