RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation
Abstract: Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks by integrating the strengths of LLMs and external knowledge databases. However, RAG introduces long sequence generation and leads to high computation and memory costs. We propose RAGCache, a novel multilevel dynamic caching system tailored for RAG. Our analysis benchmarks current RAG systems, pinpointing the performance bottleneck (i.e., long sequence due to knowledge injection) and optimization opportunities (i.e., caching knowledge's intermediate states). Based on these insights, we design RAGCache, which organizes the intermediate states of retrieved knowledge in a knowledge tree and caches them in the GPU and host memory hierarchy. RAGCache proposes a replacement policy that is aware of LLM inference characteristics and RAG retrieval patterns. It also dynamically overlaps the retrieval and inference steps to minimize the end-to-end latency. We implement RAGCache and evaluate it on vLLM, a state-of-the-art LLM inference system and Faiss, a state-of-the-art vector database. The experimental results show that RAGCache reduces the time to first token (TTFT) by up to 4x and improves the throughput by up to 2.1x compared to vLLM integrated with Faiss.
- 2024. LangChain. https://python.langchain.com/docs/get_started/introduction. (2024).
- 2024. OpenAI. https://openai.com/. (2024).
- 2024. OpenAI text-embedding-3 model. https://openai.com/blog/new-embedding-models-and-api-updates/. (2024).
- 2024. Pinecone: Introduction to Facebook AI Similarity Search (Faiss). (2024). https://www.pinecone.io/learn/series/faiss/faiss-tutorial/.
- 2024. Wikipedia (en) embedded with cohere.ai multilingual-22-12 encoder. https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings/. (2024).
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245 (2023).
- Artem Babenko and Victor Lempitsky. 2014. The inverted multi-index. IEEE transactions on pattern analysis and machine intelligence (2014).
- Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning (ICML).
- Understanding retrieval augmentation for long-form question answering. arXiv preprint arXiv:2310.12150 (2023).
- Benchmarking large language models in retrieval-augmented generation. In AAAI Conference on Artificial Intelligence.
- Spann: Highly-efficient billion-scale approximate nearest neighborhood search. Advances in Neural Information Processing Systems (2021).
- Ludmila Cherkasova. 1998. Improving WWW proxies performance with greedy-dual-size-frequency caching policy. Hewlett-Packard Laboratories Palo Alto, CA, USA.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
- Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems (2022).
- Fast approximate nearest neighbor search with the navigating spreading-out graph. In Proceedings of the VLDB Endowment.
- Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801 (2023).
- Prompt cache: Modular attention reuse for low-latency inference. arXiv preprint arXiv:2311.04934 (2023).
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020).
- Diskann: Fast accurate billion-point nearest neighbor search on a single node. Advances in Neural Information Processing Systems (2019).
- Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
- Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024).
- Piperag: Fast retrieval-augmented generation via algorithm-system co-design. arXiv preprint arXiv:2403.05676 (2024).
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
- Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024 (2022).
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics (2019).
- Efficient memory management for large language model serving with pagedattention. In ACM SOSP.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems (2020).
- Improving approximate nearest neighbor search through learned adaptive early termination. In ACM SIGMOD.
- Train big, then compress: Rethinking model size for efficient training and inference of transformers. In International Conference on Machine Learning (ICML).
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics (2024).
- CacheGen: Fast Context Loading for Language Model Applications. arXiv preprint arXiv:2310.07240 (2023).
- Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems (2024).
- Reacc: A retrieval-augmented code completion framework. arXiv preprint arXiv:2203.07722 (2022).
- Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence (2018).
- OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023).
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems (2019).
- In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics (2023).
- Fundamentals of queueing theory. Vol. 399. John Wiley & Sons.
- Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering. Transactions of the Association for Computational Linguistics (2023).
- Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. doi: 10.48550. arXiv preprint arXiv.2302.13971 (2023).
- Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509 (2022).
- Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In Chi conference on human factors in computing systems extended abstracts.
- Attention is all you need. Advances in neural information processing systems (2017).
- Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920 (2023).
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 (2023).
- HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems (2024).
- ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition. arXiv preprint arXiv:2402.15220 (2024).
- Orca: A Distributed Serving System for {{\{{Transformer-Based}}\}} Generative Models. In USENIX OSDI.
- Prompting large language model for machine translation: A case study. In International Conference on Machine Learning (ICML).
- Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics (2024).
- Fast, Approximate Vector Queries on Very Large Unstructured Datasets. In USENIX NSDI.
- Fast Vector Query Processing for Large Datasets Beyond GPU Memory with Reordered Pipelining. In USENIX NSDI.
- H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems (2024).
- Accelerating retrieval-augmented language model serving with speculation. arXiv preprint arXiv:2401.14021 (2024).
- Efficiently programming large language models using sglang. arXiv preprint arXiv:2312.07104 (2023).
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. arXiv preprint arXiv:2401.09670 (2024).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.