MuLo-SD: Multi-Scale Local Speculative Decoding

Updated 15 January 2026

The paper introduces a hierarchical speculative decoding framework that leverages multi-scale drafting and localized verification to drastically reduce wall-clock latency in autoregressive models.
MuLo-SD is a parallelized inference system that combines coarse-to-fine drafting with local token verification, ensuring both efficiency and high-fidelity outputs in chain-of-thought reasoning and image synthesis.
Empirical benchmarks show that MuLo-SD achieves speedups of 1.5–2.1× while maintaining near-baseline performance metrics, demonstrating its potential for scalable generative inference.

Multi-Scale Local Speculative Decoding (MuLo-SD) is a parallelized inference framework designed to accelerate the generation of sequences in autoregressive (AR) models. By combining multi-scale drafting—coarse-to-fine for images or stepwise for text—with localized verification and resampling mechanisms, MuLo-SD offers substantially lower wall-clock latency than conventional token-level or single-scale speculative decoding methods. The framework has demonstrated state-of-the-art performance for both chain-of-thought reasoning in LLMs and high-fidelity image synthesis, with rigorous support across theoretical analysis and empirical benchmarks (Fu et al., 24 Jun 2025, Peruzzo et al., 8 Jan 2026).

1. Conceptual Foundations and Motivation

Traditional AR models, whether for text (sequences of tokens) or images (spatial grids of discrete tokens), generate outputs one unit at a time, resulting in $\mathcal{O}(T)$ or $\mathcal{O}(S^2)$ sequential operations for a sequence of length $T$ or image of size $S \times S$ . Standard speculative decoding (SD) accelerates generation via a two-model setup: a lightweight drafter proposes several candidate tokens, and a full (target) model accepts or rejects these proposals in parallel via rejection sampling. However, token-level SD faces a fundamental acceptance rate bottleneck: the probability of an entire draft being accepted decays exponentially with the draft length, which sharply limits attainable speedup, especially as the sequence grows.

MuLo-SD advances the SD paradigm by incorporating both multi-scale drafting and local (stepwise/spatial) verification. For text, this exploits the hierarchical structure of chain-of-thought reasoning, while for images it leverages spatial redundancy and locality, addressing inefficiencies that arise from rasterscan rejection or failure to utilize multi-resolution structure (Fu et al., 24 Jun 2025, Peruzzo et al., 8 Jan 2026).

2. Multi-Scale Drafting and Local Verification

Text (Chain-of-Thought Reasoning)

Step-level Drafting: Instead of speculating $\gamma$ tokens, MuLo-SD proposes $k$ full reasoning steps. Each step is generated autoregressively by a drafter $q$ , then expanded in parallel by the target $p$ .
Inner Token-level SD: Within each reasoning step, conventional token-level SD operates, allowing additional fine-grained parallelism.
Semantic Verification: Acceptance at the step level requires only that the draft step is semantically equivalent to the target step, not exactly token-matched. A verifier $V$ (e.g., LLM-as-Judge, similarity in embedding space) determines step acceptance.

Images (AR Synthesis)

Low-to-High Resolution Drafting: MuLo-SD first generates image drafts at a reduced resolution using a low-res AR drafter $M_q$ . Learned up-samplers $U$ expand these drafts to high-res candidates.
Local Verification: Rather than requiring global match, tokens are accepted if, within their codebook neighborhoods $B_k(\cdot)$ , the sum of target probabilities exceeds a threshold $\tau$ . Only small 2D spatial neighborhoods around rejected tokens are subject to resampling, drastically reducing unnecessary recomputation.

3. Formal Algorithmic Structure

MuLo-SD is best characterized as a two-stage hierarchical speculative inference method, summarized (for language) as:

For each prefix, Drafter $q$ generates $k$ candidate steps, each internally utilizing token-level drafts of length $\gamma$ .
Target $p$ expands all steps in batched parallel fashion.
Verifier $V$ checks each $(s_j, \hat{s}_j)$ draft–target pair for semantic equivalence; all accepted drafts are output, falling back to $p$ for the failed step.

For images, the process may be summarized by:

Down-sample current prefix, draft block of tokens at low-res.
Up-sample candidates to high-res.
Allow acceptance for tokens where the pooled target model probability within $k$ -nearest codebook neighbors exceeds $\tau$ ; for rejected positions, expand to a local 2D spatial neighborhood and resample from $M_p$ .

Pseudocode (image domain, draft–verify–reject–resample loop):

Given draft block \tilde{x}_{n+1…n+L}, compute target p_i(·) in parallel
Initialize rejection set R_T ← ∅
For t = 1…L:
    If    sum_{v in B_k(\tilde{x}_{n+t})} p_{n+t}(v) ≥ τ
        accept: x_{n+t} ← \tilde{x}_{n+t}
    else
        reject: R_T ← R_T ∪ {n+t}
Let i₀ = min R_T
Define R_X ← union over all N((u_i,v_i); ℓ) for i ∈ R_T
For each j in sorted R_X:
    sequentially resample x_j ∼ M_p(·| x_{<j})
Append accepted & resampled tokens to prefix

(Peruzzo et al., 8 Jan 2026)

4. Theoretical Speedup and Empirical Performance

Analytical Speedup

Let $\alpha$ be the per-token acceptance rate, $c$ the drafter/target cost ratio, $k$ the number of speculative steps (text), $r$ the scale factor (images), and $a$ the acceptance rate after local verification.

Text: The overall theoretical speedup is $h(k, \gamma) = f(k) \cdot g(\gamma)$ , with $f(k)$ and $g(\gamma)$ denoting the step and token-level speedups. Under infinite concurrency,

$h(k,\gamma) = f(k) \times g(\gamma)$

and under hardware constraints $(k \cdot \gamma \leq M)$ , optimal settings involve both $k \geq 2$ and $\gamma \geq 2$ (Fu et al., 24 Jun 2025).

Images: If $T_p$ is the full-sequence length and $T_q = T_p/r^2$ (for a scale ratio $r$ ), the ideal speedup is

$S = \dfrac{T_p}{(1-a) \cdot T_p + T_q}$

Empirical Benchmarks

Text Domain:

On GSM8K, token-only SD reaches $\sim$ 1.4 $\times$ speedup, step-level lookahead $f(6)\approx1.7\times$ , and combined MuLo-SD achieves $2.11\times$ , preserving answer accuracy within $\pm2\%$ .
Across AIME, AMC, QA, and code generation, observed speedup of 1.5–2.0 $\times$ (Fu et al., 24 Jun 2025).

Image Domain:

On MS-COCO 2017 (Tar-1.5B) at $1024$p, MuLo-SD achieves 1.68 $\times$ speedup, with FID increase $<$ 1.0 point, GenEval and DPG-Bench semantic alignment within 1% of baseline.
Outperforms EAGLE-2 and LANTERN baselines, especially at high resolutions, maintaining negligible perceptual and semantic degradation (Peruzzo et al., 8 Jan 2026).

Method	Speedup	GenEval (%)	DPG-Bench (%)	FID	HPSv2
AR Sequential	1.00×	77.1	82.3	32.4	29.5
LANTERN	1.42×	75.4	82.3	31.1	28.5
MuLo-SD (4× up)	1.68×	76.3	82.0	32.8	28.4

5. Model Architecture and Verification Components

Drafter/Target Models: For text, distillations such as DeepSeek-R1-Distill (1.5B $\rightarrow$ 32B) and Qwen3 were used. In images, Tar-1.5B serves as the AR backbone.
Learnt Up-samplers: Composed of stacked masked convolutional ResNet blocks with pixel-shuffle, trained with mixed MSE, LPIPS, commitment and PatchGAN losses. Removal of adversarial or perceptual losses degrades output detail (Peruzzo et al., 8 Jan 2026).
Verifier: LLM-as-Judge (Qwen2.5-7B-Instruct) and embedding-based methods (all-mpnet-base-v2, threshold $\theta$ ), the former providing highest alignment accuracy for text (Fu et al., 24 Jun 2025).

6. Ablation Studies and Design Choices

Probabilistic Pooling: Pooling target probabilities over $k$ -nearest codebook neighbors raises acceptance rates at high speedup, with most gains realized via the threshold $\tau$ .
Local Neighborhood Radius ( $\ell$ ): For images, $\ell=3$ yields the most robust correction rate for $512 \rightarrow 1024p$ , with larger radii leading to excessive resampling, smaller radii insufficient for correcting context.
Multi-Branch Drafting: Exploring width $W>1$ in text increases per-step acceptance but introduces overhead growing as $W^k$ . Optimal tradeoff in practice is at $W\sim2$ (Fu et al., 24 Jun 2025).

7. Limitations and Prospects for Extension

Step Segmentation: In text, step-level boundaries are detected via fixed delimiters (e.g., “\n\n”); adaptively learned segmentation could potentially improve draft–verifier alignment.
Verifier Trade-offs: Embedding-based and target-scoring verifiers are faster, but risk higher misalignment rates; the design of lightweight, high-precision judges is an open question.
Extension to Parallel Decoders: Application of MuLo-SD in combination with architectures such as ZipAR, or adaptation to 3D video or multi-modal decoding, remains an unexplored direction (Peruzzo et al., 8 Jan 2026).
Adaptive Parameterization: The possibility of learning draft length $\gamma$ , lookahead $k$ , or local radius $\ell$ in an online fashion as a function of accept rates is an active area for further research.

MuLo-SD constitutes a unified, multi-scale, and locality-aware paradigm for AR model acceleration, establishing new state-of-the-art results in both language and vision, and providing a principled path forward for scalable, high-fidelity generative inference (Fu et al., 24 Jun 2025, Peruzzo et al., 8 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Scaling Speculative Decoding with Lookahead Reasoning (2025)

Multi-Scale Local Speculative Decoding for Image Generation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Scale Local Speculative Decoding (MuLo-SD).