Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass
Abstract: Many applications today provide users with multiple auto-complete drafts as they type, including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto-suggestions. Under the hood, LLMs support this by running an autoregressive inference pass to provide a draft. Consequently, providing $k$ drafts to the user requires running an expensive LLM $k$ times. To alleviate the computation cost of running $k$ inference passes, we propose Superposed Decoding, a new decoding algorithm that generates $k$ drafts at the computation cost of one autoregressive inference pass. We achieve this by feeding a superposition of the most recent token embeddings from the $k$ drafts as input to the next decoding step of the LLM. At every inference step we combine the $k$ drafts with the top-$k$ tokens to get $k2$ new drafts and cache the $k$ most likely options, using an n-gram interpolation with minimal compute overhead to filter out incoherent generations. Our experiments show that $k$ drafts from Superposed Decoding are at least as coherent and factual as Nucleus Sampling and Greedy Decoding respectively, while being at least $2.44\times$ faster for $k\ge3$. In a compute-normalized setting, user evaluations demonstrably favor text generated by Superposed Decoding over Nucleus Sampling. Superposed Decoding can also be combined with other decoding strategies, resulting in universal coverage gains when scaling inference time compute. Code and more examples open-sourced at https://github.com/RAIVNLab/SuperposedDecoding.
- Keyformer: Kv cache reduction through key tokens selection for efficient generative inference, 2024.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
- Language models are few-shot learners, 2020.
- Medusa: Simple llm inference acceleration framework with multiple decoding heads, 2024.
- Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Gmail smart compose: Real-time assisted writing, 2019.
- T. Computer. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022.
- Matformer: Nested transformer for elastic inference. arXiv preprint arXiv:2310.07707, 2023.
- Hierarchical neural story generation, 2018.
- Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023.
- Better & faster large language models via multi-token prediction, 2024.
- A. Gokaslan and V. Cohen. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
- Hybrid machine/human computing arrangement. United States Patent.
- Measuring massive multitask language understanding, 2021.
- The curious case of neural text degeneration, 2020.
- Billm: Pushing the limit of post-training quantization for llms, 2024.
- CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019.
- B. Institute. 9 ux best practice design patterns for autocomplete suggestions. 2023. URL https://baymard.com/blog/autocomplete-design. Accessed: 2024-05-18.
- Mistral 7b, 2023.
- On the origins of linear representations in large language models, 2024.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017.
- Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019.
- Speculative decoding with big little decoder. Advances in Neural Information Processing Systems, 36, 2024.
- Reformer: The efficient transformer, 2020.
- Soft threshold weight reparameterization for learnable sparsity. In International Conference on Machine Learning, pages 5544–5555. PMLR, 2020.
- Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019.
- Fast inference from transformers via speculative decoding, 2023.
- Infini-gram: Scaling unbounded n-gram language models to a trillion tokens. arXiv preprint arXiv:2401.17377, 2024.
- The era of 1-bit llms: All large language models are in 1.58 bits, 2024.
- Can a suit of armor conduct electricity? a new dataset for open book question answering, 2018.
- e. a. OpenAI. Gpt-4 technical report, 2024.
- The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658, 2023.
- Prophetnet: Predicting future n-gram for sequence-to-sequence pre-training, 2020.
- N. Shazeer. Fast transformer decoding: One write-head is all you need, 2019.
- Nearest neighbor zero-shot inference. In Conference on Empirical Methods in Natural Language Processing, 2022. URL https://api.semanticscholar.org/CorpusID:249152130.
- A simple and effective pruning approach for large language models, 2024.
- Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014.
- Llama: Open and efficient foundation language models, 2023a.
- Llama 2: Open foundation and fine-tuned chat models, 2023b.
- Linformer: Self-attention with linear complexity, 2020.
- Efficient long-range transformers: You need to attend more, but not necessarily at every layer, 2023.
- Texygen: A benchmarking platform for text generation models. SIGIR, 2018.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.