- The paper introduces DSD, a distributed speculative decoding framework that decouples token drafting on edge devices from verification on cloud LLMs, enhancing scalability.
- It presents an Adaptive Window Control module using deep learning to balance throughput and latency across heterogeneous deployments.
- Experimental results show up to 9.7% throughput gains and notable TPOT reductions, confirming DSD’s practical impact for scalable LLM serving.
DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Model Serving
Motivation and Problem Setting
The latency and scalability challenges inherent in LLM inference are particularly acute in heterogeneous edge-cloud deployments, where model size and device limitations frequently make centralized or node-localized serving inefficient. While speculative decoding (SD) enables partial speedup for token generation by parallelizing the candidate proposal and verification stages, existing SD approaches are limited to single-node or, at best, two-node setups, and fail to leverage the resource pool available in realistic distributed environments. The deployment of LLMs in interactive and concurrent-use cases demands scalable solutions that can orchestrate draft and target model execution across numerous edge and cloud nodes.
DSD Framework Overview
The paper introduces DSD—a distributed speculative decoding framework that generalizes conventional SD to multi-device edge-cloud topologies. In DSD, lightweight draft models (typically smaller LLMs) deployed on edge devices generate candidate tokens for a given prompt, whereas larger target models on cloud servers execute the verification stage. The system coordinates a many-to-many topology: N edge draft LLMs transmit draft tokens to M cloud-hosted target LLMs, which validate the sequences and provide acceptance signals for subsequent draft iterations.
This architecture explicitly decouples the compute-heavy verification stage from resource-constrained edge devices and flexibly enables speculative execution at scale. Compared to single-node SD, DSD introduces new performance bottlenecks—primarily due to network round-trip latency, distributed queueing, and batching effects—that profoundly impact throughput and responsiveness.
DSD-Sim: Simulation Infrastructure
Recognizing the lack of tools for quantitative evaluation of distributed SD, the authors propose DSD-Sim, a discrete-event simulation framework tailored for modeling and optimizing DSD deployments. Built on the validated VIDUR inference simulator, DSD-Sim supports:
- Per-request tracing capturing realistic acceptance sequences;
- Parameterization of device configurations, network RTT, and workload characteristics from benchmarks such as GSM8K, CNN/DailyMail, and HumanEval;
- Pluggable policy modules for routing, batching, and dynamic speculation window (token batch) sizing;
- Integration with empirical hardware latency profiles for both edge and cloud GPUs.
By supporting both fused (co-located) and distributed execution, DSD-Sim allows ablation and policy sweeps covering vast design spaces, with mean latency prediction errors under 8% against real hardware traces.
Scheduling and Adaptive Window Control (AWC)
An essential optimization in DSD is dynamic speculation window sizing: the number of draft tokens sent per iteration must balance increased parallelism (large windows) against the risk of wasted computation due to sequence rejections and increased network costs.
The paper introduces Adaptive Window Control (AWC), a deep-learning-based module leveraging a residual MLP, which receives as input recent system metrics such as target queue utilization, observed acceptance rates, RTT statistics, and TPOT (time per output token). The model is trained offline on >2,000 configuration sweeps, targeting throughput and latency SLOs.
AWC incorporates three stabilization strategies during inference:
- Output clamping to a valid range;
- Exponential smoothing over successive window sizes;
- Hysteresis between distributed and fused execution modes to avoid oscillatory switching.
Importantly, AWC enables near-optimal throughput/latency trade-off under time-varying system conditions, outperforming fixed or threshold-based window policies with negligible computational burden.
Experimental Results
Comprehensive evaluations across synthetic and trace-based multi-node deployments demonstrate that DSD, combined with AWC and length-aware batching, yields:
- Throughput increases up to 9.7% (GSM8K, 600 edge drafts, low RTT) over static window policies;
- TPOT reductions exceeding 10% in several workloads and configurations;
- TTFT improvements of up to 4% without sacrificing output quality or incurring additional manual tuning;
- Robustness to network RTT variation, with policy switches from distributed to fused mode when communication delay becomes dominant.
JSQ (Join-Shortest-Queue) routing outperforms round-robin and random strategies under moderate load but saturates at high utilization, making adaptive routing compositionally important in achieving scalable serving.
Ablation studies reveal that length-aware batching yields consistent TPOT reductions (1-2 ms), but throughput ceiling remains dictated by aggregate compute resources.
Theoretical and Practical Implications
DSD's architecture is the first SD system to operationalize distributed speculative decoding for LLM inference with decentralized draft generation. By formalizing the scheduling, window control, and queueing challenges inherent in this setting, the work expands the system design space for LLM serving. DSD demonstrates that, with appropriate simulation, policy learning, and orchestration, edge devices can collaboratively accelerate LLM inference without centralized bottlenecks.
Practically, DSD is applicable to both latency-sensitive applications (e.g., tutoring, chatbots, code generation) and high-concurrency workloads, where static SD fails to scale. DSD-Sim provides a platform for future research on queue-based, network-aware, and adaptive load balancing schemes for model serving.
Future Directions
Potential research extensions include:
- Incorporation of multi-hop or user-to-edge-to-cloud topologies to further amortize verification delay;
- Joint optimization with quantization, early-exit, or Mixture-of-Experts draft mechanisms for maximizing acceptance rates;
- Integration into production-grade inference platforms for benchmarking at scale;
- Co-optimization of SD with novel LLM architectures to maximize useful speculation under non-Markovian acceptance patterns.
Conclusion
DSD establishes a foundational framework for distributed speculative decoding in LLM serving, coupling a principled simulation-based design with learned adaptive control of token speculation. The system consistently improves throughput and responsiveness over single-node SD and heuristic scheduling schemes, effectively enabling agile large-model serving across heterogeneous edge-cloud infrastructure. The work opens new avenues for both theoretical refinement of speculative methods and practical deployment of efficient, scalable LLM APIs in production contexts.
Reference: "DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving" (2511.21669)