RAGPulse: An Open-Source RAG Workload Trace to Optimize RAG Serving Systems

Published 17 Nov 2025 in cs.LG and cs.DB | (2511.12979v1)

Abstract: Retrieval-Augmented Generation (RAG) is a critical paradigm for building reliable, knowledge-intensive LLM applications. However, the multi-stage pipeline (retrieve, generate) and unique workload characteristics (e.g., knowledge dependency) of RAG systems pose significant challenges for serving performance optimization. Existing generic LLM inference traces fail to capture these RAG-specific dynamics, creating a significant performance gap between academic research and real-world deployment. To bridge this gap, this paper introduces RAGPulse, an open-source RAG workload trace dataset. This dataset was collected from an university-wide Q&A system serving that has served more than 40,000 students and faculties since April 2024. We detail RAGPulse's system architecture, its privacy-preserving hash-based data format, and provide an in-depth statistical analysis. Our analysis reveals that real-world RAG workloads exhibit significant temporal locality and a highly skewed hot document access pattern. RAGPulse provides a high-fidelity foundation for researchers to develop and validate novel optimization strategies for RAG systems, such as content-aware batching and retrieval caching, ultimately enhancing the efficiency and reliability of RAG services. The code is available at https://github.com/flashserve/RAGPulse.

Abstract PDF Upgrade to Chat

Summary

The paper presents RAGPulse as a high-fidelity workload trace dataset that captures real-world RAG request dynamics and performance bottlenecks.
It leverages detailed analyses of temporal locality, skewed document access, and dynamic input compositions to inform caching and scheduling strategies.
The study validates optimization techniques across both offline and online system architectures, effectively bridging academic insights with practical deployment.

RAGPulse: Optimizing RAG Serving Systems with Open-Source Workload Traces

Introduction to Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) enhances LLMs by integrating external knowledge bases to address challenges such as knowledge cutoff and hallucination. RAG systems utilize a multi-stage pipeline consisting of retrieval, augmentation, and generation phases. This workflow synergizes the reasoning capabilities of LLMs with the factual accuracy of external databases, significantly improving the reliability and timeliness of LLM applications.

Despite these advantages, existing generic LLM inference traces do not capture the unique workload characteristics of RAG systems, leading to performance gaps between academic research and real-world deployment. The introduction of RAGPulse aims to bridge this gap by providing an open-source RAG workload trace dataset, offering high-fidelity insights into the real-world dynamics of RAG workloads.

Figure 1: Workflow of RAG.

RAGPulse Dataset Overview

RAGPulse was compiled from a university-wide Q&A system that serves over 40,000 students and faculties. The dataset meticulously records system-level runtime information for various RAG requests, emphasizing temporal locality and skewed hot document access patterns.

Key Dataset Features

Temporal Locality and Skewed Access Patterns: RAG workloads exhibit a highly skewed access pattern where a small subset of documents is frequently referenced, indicating significant potential for optimization through retrieval caching.
Dynamic Input Composition: Inputs vary with request length, affecting the proportional contribution of components like system prompts and retrieved passages. This variability suggests heterogeneous processing overheads contingent on request type.
Periodic Workload Fluctuations: Periodic peaks and troughs in system throughput are consistent with diurnal human activity patterns, providing critical insights for resource scheduling and load balancing.
Figure 2: CDF of Input and Output Token Lengths in RAGPulse.

Figure 3: Throughput over time in RAGPulse.

Applications and Implications of RAGPulse

RAGPulse serves as a foundation for several optimization strategies in RAG systems:

Precise Performance Bottleneck Analysis: Enables analysis of latency contributions across retrieval, reranking, and generation stages to identify bottlenecks.
Informed Scheduling and Caching: Facilitates the design of sophisticated strategies such as content-aware batching and efficient KV cache reuse under real-world inter-request dependency patterns.
High-Fidelity Benchmarking: Provides a validated basis for establishing realistic workload models, simulators, and benchmarks for academic and industrial research.

Given these capabilities, RAGPulse is a valuable resource for advancing both the theoretical understanding and practical deployment of RAG systems, empowering researchers to address contemporary challenges in LLM-serving environments.

Figure 4: Proportion of Input Components Across Different Input Lengths in RAGPulse.

Twen System Architecture

The data for RAGPulse is derived from Twen, a comprehensive RAG system designed with a microservice architecture that separates retrieval and generation tasks. This architecture utilizes a vast suite of tools, including a high-performance LLM for agent interactions and a robust vector database for document indexing.

Offline and Online Stages

Offline Stage: Involves constructing a high-quality vector knowledge base from multiple data sources, employing LLMs for tasks such as OCR and text normalization.
Online Stage: Handles real-time user queries through an Agent LLM that dynamically orchestrates processing tools and generates the RAGPulse trace records.
Figure 5: Twen's System Architecture.

Figure 6: Offline Architecture in Twen.

Figure 7: Online Architecture in Twen.

Conclusion

RAGPulse significantly contributes to closing the gap between academic RAG research and real-world implementations. By offering a comprehensive, high-fidelity snapshot of RAG-specific workload dynamics, RAGPulse paves the way for more efficient and reliable RAG serving systems. By embracing these insights, the research community can develop novel optimization techniques that address contemporary challenges in RAG systems, ultimately driving advancements in AI assisting technologies and LLM-serving infrastructures.

Markdown Report Issue