Papers
Topics
Authors
Recent
Search
2000 character limit reached

Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass

Published 28 May 2024 in cs.CL and cs.LG | (2405.18400v6)

Abstract: Many applications today provide users with multiple auto-complete drafts as they type, including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto-suggestions. Under the hood, LLMs support this by running an autoregressive inference pass to provide a draft. Consequently, providing $k$ drafts to the user requires running an expensive LLM $k$ times. To alleviate the computation cost of running $k$ inference passes, we propose Superposed Decoding, a new decoding algorithm that generates $k$ drafts at the computation cost of one autoregressive inference pass. We achieve this by feeding a superposition of the most recent token embeddings from the $k$ drafts as input to the next decoding step of the LLM. At every inference step we combine the $k$ drafts with the top-$k$ tokens to get $k2$ new drafts and cache the $k$ most likely options, using an n-gram interpolation with minimal compute overhead to filter out incoherent generations. Our experiments show that $k$ drafts from Superposed Decoding are at least as coherent and factual as Nucleus Sampling and Greedy Decoding respectively, while being at least $2.44\times$ faster for $k\ge3$. In a compute-normalized setting, user evaluations demonstrably favor text generated by Superposed Decoding over Nucleus Sampling. Superposed Decoding can also be combined with other decoding strategies, resulting in universal coverage gains when scaling inference time compute. Code and more examples open-sourced at https://github.com/RAIVNLab/SuperposedDecoding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference, 2024.
  2. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
  3. Language models are few-shot learners, 2020.
  4. Medusa: Simple llm inference acceleration framework with multiple decoding heads, 2024.
  5. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
  6. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  7. Gmail smart compose: Real-time assisted writing, 2019.
  8. T. Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  9. Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022.
  10. Matformer: Nested transformer for elastic inference. arXiv preprint arXiv:2310.07707, 2023.
  11. Hierarchical neural story generation, 2018.
  12. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023.
  13. Better & faster large language models via multi-token prediction, 2024.
  14. A. Gokaslan and V. Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
  15. Hybrid machine/human computing arrangement. United States Patent.
  16. Measuring massive multitask language understanding, 2021.
  17. The curious case of neural text degeneration, 2020.
  18. Billm: Pushing the limit of post-training quantization for llms, 2024.
  19. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019.
  20. B. Institute. 9 ux best practice design patterns for autocomplete suggestions. 2023. URL https://baymard.com/blog/autocomplete-design. Accessed: 2024-05-18.
  21. Mistral 7b, 2023.
  22. On the origins of linear representations in large language models, 2024.
  23. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017.
  24. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019.
  25. Speculative decoding with big little decoder. Advances in Neural Information Processing Systems, 36, 2024.
  26. Reformer: The efficient transformer, 2020.
  27. Soft threshold weight reparameterization for learnable sparsity. In International Conference on Machine Learning, pages 5544–5555. PMLR, 2020.
  28. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019.
  29. Fast inference from transformers via speculative decoding, 2023.
  30. Infini-gram: Scaling unbounded n-gram language models to a trillion tokens. arXiv preprint arXiv:2401.17377, 2024.
  31. The era of 1-bit llms: All large language models are in 1.58 bits, 2024.
  32. Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018.
  33. e. a. OpenAI. Gpt-4 technical report, 2024.
  34. The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658, 2023.
  35. Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training, 2020.
  36. N. Shazeer. Fast transformer decoding: One write-head is all you need, 2019.
  37. Nearest neighbor zero-shot inference. In Conference on Empirical Methods in Natural Language Processing, 2022. URL https://api.semanticscholar.org/CorpusID:249152130.
  38. A simple and effective pruning approach for large language models, 2024.
  39. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014.
  40. Llama: Open and efficient foundation language models, 2023a.
  41. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  42. Linformer: Self-attention with linear complexity, 2020.
  43. Efficient long-range transformers: You need to attend more, but not necessarily at every layer, 2023.
  44. Texygen: A benchmarking platform for text generation models. SIGIR, 2018.
Citations (2)

Summary

  • The paper demonstrates SPD, which produces k drafts in a single autoregressive pass, significantly lowering computational costs.
  • SPD leverages token superposition and n-gram interpolation to ensure coherent and factually accurate text generation.
  • Experimental results show SPD is up to 2.44 times faster than nucleus sampling and is preferred by users in real-world evaluations.

Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass

The paper "Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass" presents an innovative decoding algorithm aimed at addressing the computational inefficiencies associated with generating multiple drafts using autoregressive LMs. The proposed method, Superposed Decoding (SPD), allows for the generation of multiple (kk) drafts at the computational expense of a single inference pass.

Introduction and Motivation

Applications like GitHub's code completion and Gmail's Smart Compose rely heavily on providing multiple auto-complete suggestions, typically driven by LMs. Each additional suggestion often means additional computation, making these systems resource-intensive. This paper introduces SPD as a solution to generate kk drafts with a single autoregressive inference pass by leveraging a superposition of the token embeddings.

Methodology

SPD functions by feeding the LM a superposition of the kk most recent token embeddings from the drafts as input during each decoding step. This strategy ensures that only one autoregressive inference pass is required, significantly reducing the computational load. At each timestep, SPD merges the generated k2k^2 new drafts and filters them using an n-gram interpolation mechanism, thereby enhancing coherence and factual accuracy.

Key components of the SPD methodology include:

  • Token Superposition: The embeddings of the most recent tokens from the top kk drafts are superposed (weighted combination) to form a single input embedding for the next decoding step.
  • N-Gram Interpolation: Draft-specific next-token distributions are smoothed using a set of interpolated n-gram models, ensuring that the generated text is coherent and contextually appropriate.
  • Linearity in Token Embeddings: The SPD approach exploits the linearity of token embeddings in LMs, leveraging the properties of these embeddings to approximate beam search dynamics semantically.

Experimental Results

Generation Quality: The authors evaluate SPD on the OpenWebText dataset, showing that SPD can generate coherent drafts with an average best perplexity approximately 5\% lower than Nucleus Sampling when normalized for computation. This indicates that at least one draft generated by SPD is as coherent as those produced by traditional methods, while others are generated at no additional computational cost.

Factuality: On fact-based benchmarks such as TriviaQA and Natural Questions, SPD demonstrates higher precision (P@kk) compared to Nucleus Sampling and Beam Search, with improvements of up to 2.72% and 1.69% respectively when generating multiple drafts. This highlights the enhanced factual accuracy and coverage provided by SPD.

Latency: The study shows SPD's computational efficiency, being at least 2.44 times faster than Nucleus Sampling for generating three drafts, and scaling more favorably as the number of drafts increases. This significant reduction in latency underscores the practical advantages of SPD in real-time applications.

Human Evaluation

A user study conducted via Amazon Mechanical Turk reveals that human evaluators prefer SPD drafts over those generated by Nucleus Sampling approximately 63.6% of the time. This preference remains consistent in varied settings, including 1v1 and 2v3 comparisons, emphasizing that SPD's diversity and coherency are well-received by users.

Discussion

While the paper establishes SPD as an efficient method for draft generation, it also acknowledges limitations related to the quality of the n-gram models used and the semantic diversity of the drafts. Future work could explore integrating orthogonality in token embeddings to address the diversity aspect more comprehensively.

Conclusion

The introduction of Superposed Decoding represents a significant advancement in decoding methodologies for autoregressive LMs. By generating multiple drafts from a single inference pass, SPD offers substantial improvements in computational efficiency, coherency, and factual accuracy. The results presented in the paper are promising for applications where generating multiple high-quality suggestions is critical, such as in messaging and code completion tools. Future developments could further enhance SPD's applicability, making it a versatile tool in the ongoing evolution of LLMs.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 18 tweets with 535 likes about this paper.