Papers
Topics
Authors
Recent
Search
2000 character limit reached

MuLo-SD: Multi-Scale Local Speculative Decoding

Updated 15 January 2026
  • The paper introduces a hierarchical speculative decoding framework that leverages multi-scale drafting and localized verification to drastically reduce wall-clock latency in autoregressive models.
  • MuLo-SD is a parallelized inference system that combines coarse-to-fine drafting with local token verification, ensuring both efficiency and high-fidelity outputs in chain-of-thought reasoning and image synthesis.
  • Empirical benchmarks show that MuLo-SD achieves speedups of 1.5–2.1× while maintaining near-baseline performance metrics, demonstrating its potential for scalable generative inference.

Multi-Scale Local Speculative Decoding (MuLo-SD) is a parallelized inference framework designed to accelerate the generation of sequences in autoregressive (AR) models. By combining multi-scale drafting—coarse-to-fine for images or stepwise for text—with localized verification and resampling mechanisms, MuLo-SD offers substantially lower wall-clock latency than conventional token-level or single-scale speculative decoding methods. The framework has demonstrated state-of-the-art performance for both chain-of-thought reasoning in LLMs and high-fidelity image synthesis, with rigorous support across theoretical analysis and empirical benchmarks (Fu et al., 24 Jun 2025, Peruzzo et al., 8 Jan 2026).

1. Conceptual Foundations and Motivation

Traditional AR models, whether for text (sequences of tokens) or images (spatial grids of discrete tokens), generate outputs one unit at a time, resulting in O(T)\mathcal{O}(T) or O(S2)\mathcal{O}(S^2) sequential operations for a sequence of length TT or image of size S×SS \times S. Standard speculative decoding (SD) accelerates generation via a two-model setup: a lightweight drafter proposes several candidate tokens, and a full (target) model accepts or rejects these proposals in parallel via rejection sampling. However, token-level SD faces a fundamental acceptance rate bottleneck: the probability of an entire draft being accepted decays exponentially with the draft length, which sharply limits attainable speedup, especially as the sequence grows.

MuLo-SD advances the SD paradigm by incorporating both multi-scale drafting and local (stepwise/spatial) verification. For text, this exploits the hierarchical structure of chain-of-thought reasoning, while for images it leverages spatial redundancy and locality, addressing inefficiencies that arise from rasterscan rejection or failure to utilize multi-resolution structure (Fu et al., 24 Jun 2025, Peruzzo et al., 8 Jan 2026).

2. Multi-Scale Drafting and Local Verification

Text (Chain-of-Thought Reasoning)

  • Step-level Drafting: Instead of speculating γ\gamma tokens, MuLo-SD proposes kk full reasoning steps. Each step is generated autoregressively by a drafter qq, then expanded in parallel by the target pp.
  • Inner Token-level SD: Within each reasoning step, conventional token-level SD operates, allowing additional fine-grained parallelism.
  • Semantic Verification: Acceptance at the step level requires only that the draft step is semantically equivalent to the target step, not exactly token-matched. A verifier VV (e.g., LLM-as-Judge, similarity in embedding space) determines step acceptance.

Images (AR Synthesis)

  • Low-to-High Resolution Drafting: MuLo-SD first generates image drafts at a reduced resolution using a low-res AR drafter MqM_q. Learned up-samplers UU expand these drafts to high-res candidates.
  • Local Verification: Rather than requiring global match, tokens are accepted if, within their codebook neighborhoods Bk()B_k(\cdot), the sum of target probabilities exceeds a threshold τ\tau. Only small 2D spatial neighborhoods around rejected tokens are subject to resampling, drastically reducing unnecessary recomputation.

3. Formal Algorithmic Structure

MuLo-SD is best characterized as a two-stage hierarchical speculative inference method, summarized (for language) as:

  • For each prefix, Drafter qq generates kk candidate steps, each internally utilizing token-level drafts of length γ\gamma.
  • Target pp expands all steps in batched parallel fashion.
  • Verifier VV checks each (sj,s^j)(s_j, \hat{s}_j) draft–target pair for semantic equivalence; all accepted drafts are output, falling back to pp for the failed step.

For images, the process may be summarized by:

  • Down-sample current prefix, draft block of tokens at low-res.
  • Up-sample candidates to high-res.
  • Allow acceptance for tokens where the pooled target model probability within kk-nearest codebook neighbors exceeds τ\tau; for rejected positions, expand to a local 2D spatial neighborhood and resample from MpM_p.
Pseudocode (image domain, draft–verify–reject–resample loop):

1
2
3
4
5
6
7
8
9
10
11
12
Given draft block \tilde{x}_{n+1…n+L}, compute target p_i(·) in parallel
Initialize rejection set R_T ← ∅
For t = 1…L:
    If    sum_{v in B_k(\tilde{x}_{n+t})} p_{n+t}(v) ≥ τ
        accept: x_{n+t} ← \tilde{x}_{n+t}
    else
        reject: R_T ← R_T ∪ {n+t}
Let i₀ = min R_T
Define R_X ← union over all N((u_i,v_i); ℓ) for i ∈ R_T
For each j in sorted R_X:
    sequentially resample x_j ∼ M_p(·| x_{<j})
Append accepted & resampled tokens to prefix
(Peruzzo et al., 8 Jan 2026)

4. Theoretical Speedup and Empirical Performance

Analytical Speedup

Let α\alpha be the per-token acceptance rate, cc the drafter/target cost ratio, kk the number of speculative steps (text), rr the scale factor (images), and aa the acceptance rate after local verification.

  • Text: The overall theoretical speedup is h(k,γ)=f(k)g(γ)h(k, \gamma) = f(k) \cdot g(\gamma), with f(k)f(k) and g(γ)g(\gamma) denoting the step and token-level speedups. Under infinite concurrency,

h(k,γ)=f(k)×g(γ)h(k,\gamma) = f(k) \times g(\gamma)

and under hardware constraints (kγM)(k \cdot \gamma \leq M), optimal settings involve both k2k \geq 2 and γ2\gamma \geq 2 (Fu et al., 24 Jun 2025).

  • Images: If TpT_p is the full-sequence length and Tq=Tp/r2T_q = T_p/r^2 (for a scale ratio rr), the ideal speedup is

S=Tp(1a)Tp+TqS = \dfrac{T_p}{(1-a) \cdot T_p + T_q}

Empirical Benchmarks

Text Domain:

  • On GSM8K, token-only SD reaches \sim1.4×\times speedup, step-level lookahead f(6)1.7×f(6)\approx1.7\times, and combined MuLo-SD achieves 2.11×2.11\times, preserving answer accuracy within ±2%\pm2\%.
  • Across AIME, AMC, QA, and code generation, observed speedup of 1.5–2.0×\times (Fu et al., 24 Jun 2025).

Image Domain:

  • On MS-COCO 2017 (Tar-1.5B) at $1024$p, MuLo-SD achieves 1.68×\times speedup, with FID increase <<1.0 point, GenEval and DPG-Bench semantic alignment within 1% of baseline.
  • Outperforms EAGLE-2 and LANTERN baselines, especially at high resolutions, maintaining negligible perceptual and semantic degradation (Peruzzo et al., 8 Jan 2026).
Method Speedup GenEval (%) DPG-Bench (%) FID HPSv2
AR Sequential 1.00× 77.1 82.3 32.4 29.5
LANTERN 1.42× 75.4 82.3 31.1 28.5
MuLo-SD (4× up) 1.68× 76.3 82.0 32.8 28.4

5. Model Architecture and Verification Components

  • Drafter/Target Models: For text, distillations such as DeepSeek-R1-Distill (1.5B \rightarrow 32B) and Qwen3 were used. In images, Tar-1.5B serves as the AR backbone.
  • Learnt Up-samplers: Composed of stacked masked convolutional ResNet blocks with pixel-shuffle, trained with mixed MSE, LPIPS, commitment and PatchGAN losses. Removal of adversarial or perceptual losses degrades output detail (Peruzzo et al., 8 Jan 2026).
  • Verifier: LLM-as-Judge (Qwen2.5-7B-Instruct) and embedding-based methods (all-mpnet-base-v2, threshold θ\theta), the former providing highest alignment accuracy for text (Fu et al., 24 Jun 2025).

6. Ablation Studies and Design Choices

  • Probabilistic Pooling: Pooling target probabilities over kk-nearest codebook neighbors raises acceptance rates at high speedup, with most gains realized via the threshold τ\tau.
  • Local Neighborhood Radius (\ell): For images, =3\ell=3 yields the most robust correction rate for 5121024p512 \rightarrow 1024p, with larger radii leading to excessive resampling, smaller radii insufficient for correcting context.
  • Multi-Branch Drafting: Exploring width W>1W>1 in text increases per-step acceptance but introduces overhead growing as WkW^k. Optimal tradeoff in practice is at W2W\sim2 (Fu et al., 24 Jun 2025).

7. Limitations and Prospects for Extension

  • Step Segmentation: In text, step-level boundaries are detected via fixed delimiters (e.g., “\n\n”); adaptively learned segmentation could potentially improve draft–verifier alignment.
  • Verifier Trade-offs: Embedding-based and target-scoring verifiers are faster, but risk higher misalignment rates; the design of lightweight, high-precision judges is an open question.
  • Extension to Parallel Decoders: Application of MuLo-SD in combination with architectures such as ZipAR, or adaptation to 3D video or multi-modal decoding, remains an unexplored direction (Peruzzo et al., 8 Jan 2026).
  • Adaptive Parameterization: The possibility of learning draft length γ\gamma, lookahead kk, or local radius \ell in an online fashion as a function of accept rates is an active area for further research.

MuLo-SD constitutes a unified, multi-scale, and locality-aware paradigm for AR model acceleration, establishing new state-of-the-art results in both language and vision, and providing a principled path forward for scalable, high-fidelity generative inference (Fu et al., 24 Jun 2025, Peruzzo et al., 8 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Scale Local Speculative Decoding (MuLo-SD).