Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference

Published 23 May 2024 in cs.DC, cs.AI, cs.CL, and cs.LG | (2405.14105v5)

Abstract: This paper introduces distributed speculative inference (DSI), a novel inference algorithm that is provably faster than speculative inference (SI) [leviathan2023, chen2023, miao2024, sun2025, timor2025] and standard autoregressive inference (non-SI). Like other SI algorithms, DSI operates on frozen LMs, requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups over non-SI--but rely on sufficiently fast and accurate drafters, which are often unavailable in practice. We identify a gap where SI can be slower than non-SI if drafters are too slow or inaccurate. We close this gap by proving that DSI is faster than both SI and non-SI--given any drafters. DSI is therefore not only faster than SI, but also unlocks the acceleration of LMs for which SI fails. DSI leverages speculation parallelism (SP), a novel type of task parallelism, to orchestrate target and drafter instances that overlap in time, establishing a new foundational tradeoff between computational resources and latency. Our simulations show that DSI is 1.29-1.92x faster than SI in single-node setups for various off-the-shelf LMs and tasks. We open-source all our code.

Abstract PDF Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces Distributed Speculative Inference (DSI), a method that divides token prediction across multiple processors to speed up LLM inference by 1.29-1.92×.
The approach leverages concurrent verification by combining target models with faster, less accurate drafters to maintain output quality.
The method demonstrates practical benefits in real-time applications such as algorithmic trading and autonomous driving.

Distributed Speculative Inference

Introduction

In the ongoing challenge of accelerating LLMs, researchers have introduced a method known as Distributed Speculative Inference (DSI). This technique is found to be faster than traditional speculative inference (SI) and traditional autoregressive inference. Let's break down what this means and how it can potentially improve the performance of LLMs in data-heavy, real-time applications.

The Problem with Existing Methods

LLMs like GPT-4, while incredibly powerful, are often too slow to be used efficiently in settings where rapid response times are essential, such as algorithmic trading or autonomous driving. Efforts to speed up these models generally fall into two camps:

Algorithmic innovations: These involve compressing LLMs through methods like pruning, quantization, and knowledge distillation. However, these usually degrade the quality of the model outputs.
System optimizations: This includes improving the hardware utilization through tensor parallelism and kernel optimizations.

While previous research on speculative inference has shown promise by using faster, approximating models (called "drafters") to predict parts of the outputs, there's a significant catch: it only works well when the drafter is both extremely fast and highly accurate. This limitation makes it less reliable in many real-world scenarios.

Enter Distributed Speculative Inference

Distributed Speculative Inference (DSI) addresses these limitations by using multiple processors to divide the workload. By orchestrating several instances of both the target LLM and the drafters, DSI ensures faster inference times, even when the drafter models are not exceptionally fast or accurate.

How Does DSI Work?

DSI enhances the traditional SI method by distributing the task across multiple processors (like GPUs). Here’s a simplified breakdown:

Initialization: Multiple threads (acting as drafters) predict possible next tokens based on the current input.
Concurrent Verification: The target model verifies these predicted tokens. If any thread's prediction matches the target model's output, that thread continues to the next token prediction while rejecting and terminating incorrect threads.
Efficiency through Parallelism: By running multiple processes in parallel, DSI efficiently narrows down to the correct sequence of tokens, verified by the target model.

Key Findings

Researchers conducted extensive simulations and experiments using various target/drafter pairs, demonstrating significant speedups. Here are some notable results:

DSI showed a 1.29-1.92x speedup over conventional SI across various tasks and models.
Even with slower and less accurate drafters, DSI maintained better performance than both SI and non-SI methods.
Using multiple GPUs, DSI consistently outperformed SI, making it a more robust choice for real-time applications.

Implications and Future Developments

Practical Implications: DSI offers a practical way to accelerate LLMs in real-time environments. It allows for the use of less optimized drafter models and still achieves significant speedups, making it a versatile and robust choice for various industrial applications, from finance to transportation.

Theoretical Implications: On a theoretical level, DSI challenges the assumption that drafter models need to be remarkably fast and accurate. By effectively utilizing multiple processors, it opens new avenues for more efficient computational methods in AI.

Final Thoughts

Distributed Speculative Inference represents a significant improvement in the efficient use of LLMs. By leveraging multiple processors, DSI not only surpasses traditional SI methods in speed but also broadens the scope of practical applications where LLMs can be deployed. Future research might explore optimizing the number of processors needed or investigating even more efficient parallel processing algorithms.

While DSI does require careful consideration of the computational resources, its promise for significantly reducing inference times without compromising on the output quality makes it a compelling advancement in the field of AI.

References

Andreas, J., Li, Y., et al. (2022). Chain of Thought Prompting Elicits Reasoning in LLMs.
Bubeck, S., et al. (2023). Sparks of Artificial General Intelligence: Early Experiments with GPT-4.
Hinton, G., Vinyals, O., Dean, J. (2015). Distilling the Knowledge in a Neural Network.