Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
Abstract: LLMs demonstrate substantial potential across a diverse array of domains via request serving. However, as trends continue to push for expanding context sizes, the autoregressive nature of LLMs results in highly dynamic behavior of the attention layers, showcasing significant differences in computational characteristics and memory requirements from the non-attention layers. This presents substantial challenges for resource management and performance optimization in service systems. Existing static model parallelism and resource allocation strategies fall short when dealing with this dynamicity. To address the issue, we propose Infinite-LLM, a novel LLM serving system designed to effectively handle dynamic context lengths. Infinite-LLM disaggregates attention layers from an LLM's inference process, facilitating flexible and independent resource scheduling that optimizes computational performance and enhances memory utilization jointly. By leveraging a pooled GPU memory strategy across a cluster, Infinite-LLM not only significantly boosts system throughput but also supports extensive context lengths. Evaluated on a dataset with context lengths ranging from a few to 2000K tokens across a cluster with 32 A100 GPUs, Infinite-LLM demonstrates throughput improvement of 1.35-3.4x compared to state-of-the-art methods, enabling efficient and elastic LLM deployment.
- Nvidia collective communication library. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html, 2020.
- Fastertransformer. https://github.com/NVIDIA/FasterTransformer, 2021.
- Amazon s3: Object storage built to retrieve any amount of data from anywhere. https://aws.amazon.com/s3, 2023.
- Deepspeed-fastgen: High-throughput text generation for llms via mii and deepspeed-inference. https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen, 2023.
- Large language model text generation inference. https://huggingface.co/docs/text-generation-inference, 2023.
- Lmdeploy. https://github.com/InternLM/lmdeploy, 2023.
- Simple, safe way to store and distribute tensors. https://huggingface.co/docs/safetensors, 2023.
- Tensorrt-llm. https://github.com/NVIDIA/TensorRT-LLM, 2023.
- Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369, 2023.
- Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE, 2022.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- A survey on evaluation of large language models, 2023.
- Evaluating large language models trained on code, 2021.
- Generating long sequences with sparse transformers, 2019.
- Dapple: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 431–445, 2021.
- Low latency rnn inference with cellular batching. In Proceedings of the Thirteenth EuroSys Conference, pages 1–15, 2018.
- Github. https://github.com/features/copilot, 2022.
- Google. https://bard.google.com, 2023.
- Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
- Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
- Beyond data and model parallelism for deep neural networks. Proceedings of Machine Learning and Systems, 1:1–13, 2019.
- Mistral 7b, 2023.
- Using rdma efficiently for key-value services. ACM SIGCOMM Computer Communication Review, 44:295 – 306, 2014.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
- Alpaserve: Statistical multiplexing with model parallelism for deep learning serving. arXiv preprint arXiv:2302.11665, 2023.
- Blockwise parallel transformer for long context large models. arXiv preprint arXiv:2305.19370, 2023.
- Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023.
- Gpteval: A survey on assessments of chatgpt and gpt-4. arXiv preprint arXiv:2308.12488, 2023.
- Ray: A distributed framework for emerging {{\{{AI}}\}} applications. In 13th USENIX symposium on operating systems design and implementation (OSDI 18), pages 561–577, 2018.
- Pipedream: Generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019.
- Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937–7947. PMLR, 2021.
- OpenAI. https://openai.com/blog/chatgpt, 2022.
- OpenAI. Gpt-4 technical report, 2023.
- Service level agreement in cloud computing. 2009.
- Yarn: Efficient context window extension of large language models, 2023.
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
- Improving language understanding by generative pre-training. 2018.
- Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019.
- Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
- Generating text with recurrent neural networks. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 1017–1024, 2011.
- Beyond the hype: Assessing the performance, trustworthiness, and clinical suitability of gpt3. 5. arXiv preprint arXiv:2306.15887, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Focused transformer: Contrastive training for context scaling. arXiv preprint arXiv:2307.03170, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
- Petuum: A new platform for distributed machine learning on big data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1335–1344, 2015.
- Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022.
- H22{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPTo: Heavy-hitter oracle for efficient generative inference of large language models, 2023.
- Alpa: Automating inter-and {{\{{Intra-Operator}}\}} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.