Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass

Published 28 May 2024 in cs.CL and cs.LG | (2405.18400v6)

Abstract: Many applications today provide users with multiple auto-complete drafts as they type, including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto-suggestions. Under the hood, LLMs support this by running an autoregressive inference pass to provide a draft. Consequently, providing $k$ drafts to the user requires running an expensive LLM $k$ times. To alleviate the computation cost of running $k$ inference passes, we propose Superposed Decoding, a new decoding algorithm that generates $k$ drafts at the computation cost of one autoregressive inference pass. We achieve this by feeding a superposition of the most recent token embeddings from the $k$ drafts as input to the next decoding step of the LLM. At every inference step we combine the $k$ drafts with the top-$k$ tokens to get $k^2$ new drafts and cache the $k$ most likely options, using an n-gram interpolation with minimal compute overhead to filter out incoherent generations. Our experiments show that $k$ drafts from Superposed Decoding are at least as coherent and factual as Nucleus Sampling and Greedy Decoding respectively, while being at least $2.44\times$ faster for $k\ge3$. In a compute-normalized setting, user evaluations demonstrably favor text generated by Superposed Decoding over Nucleus Sampling. Superposed Decoding can also be combined with other decoding strategies, resulting in universal coverage gains when scaling inference time compute. Code and more examples open-sourced at https://github.com/RAIVNLab/SuperposedDecoding.

Abstract PDF HTML Upgrade to Chat

References (44)

Citations (2)

View on Semantic Scholar

Summary

The paper demonstrates SPD, which produces k drafts in a single autoregressive pass, significantly lowering computational costs.
SPD leverages token superposition and n-gram interpolation to ensure coherent and factually accurate text generation.
Experimental results show SPD is up to 2.44 times faster than nucleus sampling and is preferred by users in real-world evaluations.

Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass

The paper "Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass" presents an innovative decoding algorithm aimed at addressing the computational inefficiencies associated with generating multiple drafts using autoregressive LMs. The proposed method, Superposed Decoding (SPD), allows for the generation of multiple ( $k$ ) drafts at the computational expense of a single inference pass.

Introduction and Motivation

Applications like GitHub's code completion and Gmail's Smart Compose rely heavily on providing multiple auto-complete suggestions, typically driven by LMs. Each additional suggestion often means additional computation, making these systems resource-intensive. This paper introduces SPD as a solution to generate $k$ drafts with a single autoregressive inference pass by leveraging a superposition of the token embeddings.

Methodology

SPD functions by feeding the LM a superposition of the $k$ most recent token embeddings from the drafts as input during each decoding step. This strategy ensures that only one autoregressive inference pass is required, significantly reducing the computational load. At each timestep, SPD merges the generated $k^2$ new drafts and filters them using an n-gram interpolation mechanism, thereby enhancing coherence and factual accuracy.

Key components of the SPD methodology include:

Token Superposition: The embeddings of the most recent tokens from the top $k$ drafts are superposed (weighted combination) to form a single input embedding for the next decoding step.
N-Gram Interpolation: Draft-specific next-token distributions are smoothed using a set of interpolated n-gram models, ensuring that the generated text is coherent and contextually appropriate.
Linearity in Token Embeddings: The SPD approach exploits the linearity of token embeddings in LMs, leveraging the properties of these embeddings to approximate beam search dynamics semantically.

Experimental Results

Generation Quality: The authors evaluate SPD on the OpenWebText dataset, showing that SPD can generate coherent drafts with an average best perplexity approximately 5\% lower than Nucleus Sampling when normalized for computation. This indicates that at least one draft generated by SPD is as coherent as those produced by traditional methods, while others are generated at no additional computational cost.

Factuality: On fact-based benchmarks such as TriviaQA and Natural Questions, SPD demonstrates higher precision (P@ $k$ ) compared to Nucleus Sampling and Beam Search, with improvements of up to 2.72% and 1.69% respectively when generating multiple drafts. This highlights the enhanced factual accuracy and coverage provided by SPD.

Latency: The study shows SPD's computational efficiency, being at least 2.44 times faster than Nucleus Sampling for generating three drafts, and scaling more favorably as the number of drafts increases. This significant reduction in latency underscores the practical advantages of SPD in real-time applications.

Human Evaluation

A user study conducted via Amazon Mechanical Turk reveals that human evaluators prefer SPD drafts over those generated by Nucleus Sampling approximately 63.6% of the time. This preference remains consistent in varied settings, including 1v1 and 2v3 comparisons, emphasizing that SPD's diversity and coherency are well-received by users.

Discussion

While the paper establishes SPD as an efficient method for draft generation, it also acknowledges limitations related to the quality of the n-gram models used and the semantic diversity of the drafts. Future work could explore integrating orthogonality in token embeddings to address the diversity aspect more comprehensively.

Conclusion

The introduction of Superposed Decoding represents a significant advancement in decoding methodologies for autoregressive LMs. By generating multiple drafts from a single inference pass, SPD offers substantial improvements in computational efficiency, coherency, and factual accuracy. The results presented in the paper are promising for applications where generating multiple high-quality suggestions is critical, such as in messaging and code completion tools. Future developments could further enhance SPD's applicability, making it a versatile tool in the ongoing evolution of LLMs.

Markdown Report Issue