MuLo-SD: Multi-Scale Local Speculative Decoding
- The paper introduces a hierarchical speculative decoding framework that leverages multi-scale drafting and localized verification to drastically reduce wall-clock latency in autoregressive models.
- MuLo-SD is a parallelized inference system that combines coarse-to-fine drafting with local token verification, ensuring both efficiency and high-fidelity outputs in chain-of-thought reasoning and image synthesis.
- Empirical benchmarks show that MuLo-SD achieves speedups of 1.5–2.1× while maintaining near-baseline performance metrics, demonstrating its potential for scalable generative inference.
Multi-Scale Local Speculative Decoding (MuLo-SD) is a parallelized inference framework designed to accelerate the generation of sequences in autoregressive (AR) models. By combining multi-scale drafting—coarse-to-fine for images or stepwise for text—with localized verification and resampling mechanisms, MuLo-SD offers substantially lower wall-clock latency than conventional token-level or single-scale speculative decoding methods. The framework has demonstrated state-of-the-art performance for both chain-of-thought reasoning in LLMs and high-fidelity image synthesis, with rigorous support across theoretical analysis and empirical benchmarks (Fu et al., 24 Jun 2025, Peruzzo et al., 8 Jan 2026).
1. Conceptual Foundations and Motivation
Traditional AR models, whether for text (sequences of tokens) or images (spatial grids of discrete tokens), generate outputs one unit at a time, resulting in or sequential operations for a sequence of length or image of size . Standard speculative decoding (SD) accelerates generation via a two-model setup: a lightweight drafter proposes several candidate tokens, and a full (target) model accepts or rejects these proposals in parallel via rejection sampling. However, token-level SD faces a fundamental acceptance rate bottleneck: the probability of an entire draft being accepted decays exponentially with the draft length, which sharply limits attainable speedup, especially as the sequence grows.
MuLo-SD advances the SD paradigm by incorporating both multi-scale drafting and local (stepwise/spatial) verification. For text, this exploits the hierarchical structure of chain-of-thought reasoning, while for images it leverages spatial redundancy and locality, addressing inefficiencies that arise from rasterscan rejection or failure to utilize multi-resolution structure (Fu et al., 24 Jun 2025, Peruzzo et al., 8 Jan 2026).
2. Multi-Scale Drafting and Local Verification
Text (Chain-of-Thought Reasoning)
- Step-level Drafting: Instead of speculating tokens, MuLo-SD proposes full reasoning steps. Each step is generated autoregressively by a drafter , then expanded in parallel by the target .
- Inner Token-level SD: Within each reasoning step, conventional token-level SD operates, allowing additional fine-grained parallelism.
- Semantic Verification: Acceptance at the step level requires only that the draft step is semantically equivalent to the target step, not exactly token-matched. A verifier (e.g., LLM-as-Judge, similarity in embedding space) determines step acceptance.
Images (AR Synthesis)
- Low-to-High Resolution Drafting: MuLo-SD first generates image drafts at a reduced resolution using a low-res AR drafter . Learned up-samplers expand these drafts to high-res candidates.
- Local Verification: Rather than requiring global match, tokens are accepted if, within their codebook neighborhoods , the sum of target probabilities exceeds a threshold . Only small 2D spatial neighborhoods around rejected tokens are subject to resampling, drastically reducing unnecessary recomputation.
3. Formal Algorithmic Structure
MuLo-SD is best characterized as a two-stage hierarchical speculative inference method, summarized (for language) as:
- For each prefix, Drafter generates candidate steps, each internally utilizing token-level drafts of length .
- Target expands all steps in batched parallel fashion.
- Verifier checks each draft–target pair for semantic equivalence; all accepted drafts are output, falling back to for the failed step.
For images, the process may be summarized by:
- Down-sample current prefix, draft block of tokens at low-res.
- Up-sample candidates to high-res.
- Allow acceptance for tokens where the pooled target model probability within -nearest codebook neighbors exceeds ; for rejected positions, expand to a local 2D spatial neighborhood and resample from .
Pseudocode (image domain, draft–verify–reject–resample loop):
1 2 3 4 5 6 7 8 9 10 11 12 |
Given draft block \tilde{x}_{n+1…n+L}, compute target p_i(·) in parallel
Initialize rejection set R_T ← ∅
For t = 1…L:
If sum_{v in B_k(\tilde{x}_{n+t})} p_{n+t}(v) ≥ τ
accept: x_{n+t} ← \tilde{x}_{n+t}
else
reject: R_T ← R_T ∪ {n+t}
Let i₀ = min R_T
Define R_X ← union over all N((u_i,v_i); ℓ) for i ∈ R_T
For each j in sorted R_X:
sequentially resample x_j ∼ M_p(·| x_{<j})
Append accepted & resampled tokens to prefix |
4. Theoretical Speedup and Empirical Performance
Analytical Speedup
Let be the per-token acceptance rate, the drafter/target cost ratio, the number of speculative steps (text), the scale factor (images), and the acceptance rate after local verification.
- Text: The overall theoretical speedup is , with and denoting the step and token-level speedups. Under infinite concurrency,
and under hardware constraints , optimal settings involve both and (Fu et al., 24 Jun 2025).
- Images: If is the full-sequence length and (for a scale ratio ), the ideal speedup is
Empirical Benchmarks
Text Domain:
- On GSM8K, token-only SD reaches 1.4 speedup, step-level lookahead , and combined MuLo-SD achieves , preserving answer accuracy within .
- Across AIME, AMC, QA, and code generation, observed speedup of 1.5–2.0 (Fu et al., 24 Jun 2025).
Image Domain:
- On MS-COCO 2017 (Tar-1.5B) at $1024$p, MuLo-SD achieves 1.68 speedup, with FID increase 1.0 point, GenEval and DPG-Bench semantic alignment within 1% of baseline.
- Outperforms EAGLE-2 and LANTERN baselines, especially at high resolutions, maintaining negligible perceptual and semantic degradation (Peruzzo et al., 8 Jan 2026).
| Method | Speedup | GenEval (%) | DPG-Bench (%) | FID | HPSv2 |
|---|---|---|---|---|---|
| AR Sequential | 1.00× | 77.1 | 82.3 | 32.4 | 29.5 |
| LANTERN | 1.42× | 75.4 | 82.3 | 31.1 | 28.5 |
| MuLo-SD (4× up) | 1.68× | 76.3 | 82.0 | 32.8 | 28.4 |
5. Model Architecture and Verification Components
- Drafter/Target Models: For text, distillations such as DeepSeek-R1-Distill (1.5B 32B) and Qwen3 were used. In images, Tar-1.5B serves as the AR backbone.
- Learnt Up-samplers: Composed of stacked masked convolutional ResNet blocks with pixel-shuffle, trained with mixed MSE, LPIPS, commitment and PatchGAN losses. Removal of adversarial or perceptual losses degrades output detail (Peruzzo et al., 8 Jan 2026).
- Verifier: LLM-as-Judge (Qwen2.5-7B-Instruct) and embedding-based methods (all-mpnet-base-v2, threshold ), the former providing highest alignment accuracy for text (Fu et al., 24 Jun 2025).
6. Ablation Studies and Design Choices
- Probabilistic Pooling: Pooling target probabilities over -nearest codebook neighbors raises acceptance rates at high speedup, with most gains realized via the threshold .
- Local Neighborhood Radius (): For images, yields the most robust correction rate for , with larger radii leading to excessive resampling, smaller radii insufficient for correcting context.
- Multi-Branch Drafting: Exploring width in text increases per-step acceptance but introduces overhead growing as . Optimal tradeoff in practice is at (Fu et al., 24 Jun 2025).
7. Limitations and Prospects for Extension
- Step Segmentation: In text, step-level boundaries are detected via fixed delimiters (e.g., “\n\n”); adaptively learned segmentation could potentially improve draft–verifier alignment.
- Verifier Trade-offs: Embedding-based and target-scoring verifiers are faster, but risk higher misalignment rates; the design of lightweight, high-precision judges is an open question.
- Extension to Parallel Decoders: Application of MuLo-SD in combination with architectures such as ZipAR, or adaptation to 3D video or multi-modal decoding, remains an unexplored direction (Peruzzo et al., 8 Jan 2026).
- Adaptive Parameterization: The possibility of learning draft length , lookahead , or local radius in an online fashion as a function of accept rates is an active area for further research.
MuLo-SD constitutes a unified, multi-scale, and locality-aware paradigm for AR model acceleration, establishing new state-of-the-art results in both language and vision, and providing a principled path forward for scalable, high-fidelity generative inference (Fu et al., 24 Jun 2025, Peruzzo et al., 8 Jan 2026).