Papers
Topics
Authors
Recent
Search
2000 character limit reached

SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications

Published 7 Nov 2024 in cs.CL, cs.AI, cs.DC, and cs.LG | (2411.04975v2)

Abstract: Speculative decoding is widely adopted to reduce latency in LLM inference by leveraging smaller draft models capable of handling diverse user tasks. However, emerging AI applications, such as LLM-based agents, present unique workload characteristics: instead of diverse independent requests, agentic frameworks typically submit repetitive inference requests, such as multi-agent pipelines performing similar subtasks or self-refinement loops iteratively enhancing outputs. These workloads result in long and highly predictable sequences, which current speculative decoding methods do not effectively exploit. To address this gap, we introduce \emph{SuffixDecoding}, a novel method that utilizes efficient suffix trees to cache long token sequences from prompts and previous outputs. By adaptively speculating more tokens when acceptance likelihood is high and fewer when it is low, SuffixDecoding effectively exploits opportunities for longer speculations while conserving computation when those opportunities are limited. Evaluations on agentic benchmarks, including SWE-Bench and Text-to-SQL, demonstrate that SuffixDecoding achieves speedups of up to 5.3$\times$, outperforming state-of-the-art methods -- 2.8$\times$ faster than model-based approaches like EAGLE-2/3 and 1.9$\times$ faster than model-free approaches such as Token Recycling. SuffixDecoding is open-sourced at https://github.com/snowflakedb/ArcticInference.

Summary

  • The paper introduces SuffixDecoding, which uses suffix trees for speculative decoding to accelerate LLM inference without model-based overhead.
  • It demonstrates up to 2.9× throughput improvement and 3× lower latency in tasks like code generation and text-to-SQL compared to existing methods.
  • The approach leverages CPU-efficient data structures, making it a practical alternative in environments where GPU resources are limited.

SuffixDecoding: A Model-Free Approach to Speeding Up LLM Inference

The paper presents SuffixDecoding, a novel model-free approach designed to accelerate the inference of LLMs by employing speculative decoding. Unlike traditional methods that typically rely on draft models or additional decoding heads, SuffixDecoding leverages efficient data structures built from previously generated outputs—specifically, suffix trees—to predict candidate sequences. This innovative technique utilizes pattern recognition in previously generated text to construct speculation trees, providing a theoretically grounded and empirical method for selecting which token sequences to propose for verification by the LLM.

Technical Approach

SuffixDecoding constructs and updates suffix trees based on generated token sequences to model the probability of future sequences. This approach requires only the computational power of CPUs rather than that of GPUs, which is advantageous given the typical underutilized CPU resources in LLM serving nodes. The suffix trees store tokens of previously generated sequences, capturing shared prefixes in a compact structure. Each node in a suffix tree represents a token, and traversals along these nodes represent possible continuations during LLM inference.

Once a suffix tree is constructed, SuffixDecoding utilizes a greedy algorithm to expand speculative trees, leveraging empirical frequency statistics to score and select the most promising candidate sequences. This tree structure allows for efficient speculation, enabling parallel verification across potential candidate sequences.

Evaluation

Empirical evaluations demonstrate that SuffixDecoding achieves competitive performance with state-of-the-art model-based speculative methods across multiple workloads, including open-domain chat scenarios, code generation tasks, and text-to-SQL systems. Particularly notable is the improvement observed in multi-agent pipeline applications. For instance, in the proprietary multi-LLM text-to-SQL environment, named AgenticSQL, SuffixDecoding achieves up to 2.9× higher throughput and up to 3× lower latency compared to existing speculative decoding methods.

Furthermore, tests on datasets such as Magicoder and WildChat showed that SuffixDecoding's performance is not only comparable to that of tree-based speculative decoding techniques but, in some cases, even surpasses them without the overhead associated with draft models. This is particularly significant given that SuffixDecoding can achieve these results with just a few thousand examples in its reference corpus, showcasing its efficiency and practicality.

Implications and Future Work

The implications of SuffixDecoding's introduction are multifaceted. Practically, it offers a scalable and more resource-efficient option for speeding up inference, which is particularly beneficial in environments where GPU resources are constrained or where rapid model updates are frequent. Theoretically, it paves the way for further exploration into model-free speculative inference techniques, potentially leading to algorithms that adapt even more dynamically to real-world applications.

Future developments might focus on enhancing the pattern matching and scoring mechanisms of SuffixDecoding. While its performance in adapting to distributional shifts in the input has been demonstrated effectively, there remains potential to optimize how the algorithm prioritizes and scores different speculative paths, perhaps incorporating more complex statistical models.

In conclusion, SuffixDecoding marks a significant advance in leveraging data-driven techniques for LLM inference, minimizing dependence on resource-intensive draft models while providing robust performance across a diverse array of tasks. Continued research in model-free approaches may yield further innovations, expanding the applicability and efficiency of LLMs in practical settings.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 43 likes about this paper.