- The paper introduces Zep, a memory layer service using a bi-temporal, hierarchical knowledge graph that enhances multi-hop and temporal reasoning.
- It employs a triple-layered architecture and multi-stage reranking to achieve up to 18.5% accuracy gains and 90% latency reduction over full-context methods.
- The system enables precise provenance tracing and low-latency context retrieval, making it ideal for enterprise-level, long-term interactive agent applications.
Zep: A Temporal Knowledge Graph Architecture for Agent Memory
Introduction
The presented work introduces Zep, an advanced memory layer service for LLM-based agents that centers agent memory and retrieval around a temporally-aware, hierarchical knowledge graph architecture. Zep is specifically optimized for enterprise deployments requiring integration of continuously evolving conversational and structured business data, and places strong emphasis on data fidelity, temporal consistency, search accuracy, and low-latency context retrieval. The system advances the state of the art in contextual memory for agents, achieving superior performance to MemGPT in both the Deep Memory Retrieval (DMR) and the LongMemEval (LME) benchmarks, with pronounced gains in complex, real-world tasks requiring long-term and temporal reasoning.
Motivation and System Design
Current RAG frameworks for LLM agents typically restrict themselves to static corpora and do not address the continuous, multi-source synthesis required for real-world agent deployment. Zep addresses these limitations by framing agent memory as a dynamic, bi-temporal knowledge graph with explicit support for:
- Multiplexed memory construction over episodic (raw interaction) and semantic (extracted entity/fact) subgraphs.
- Hierarchical clustering of knowledge into communities, supporting high-level summarization and efficient, localized search.
- Temporal extraction and explicit edge invalidation, tracking both fact validity and transactional system state.
- Non-lossy, bidirectional data indexing, allowing precise provenance tracing and historical context access.
This design aligns with cognitive memory models and extends the knowledge graph-based RAG paradigm, most notably by operationalizing the bi-temporal model throughout the ingestion, storage, and retrieval pipeline.
Knowledge Graph Construction
Zep’s Graphiti engine instantiates the knowledge graph G=(N,E,ϕ) as a triple-layered architecture comprising:
- Episode Subgraph: Stores raw episodic data (messages or structured events) with reference timestamps. Indexes to semantic nodes serve as the interface for incremental knowledge extraction.
- Semantic Entity Subgraph: Encodes entities and inter-entity relations as nodes and edges, supports entity deduplication (via embedding- and LLM-based matching), and leverages a reflection-inspired validation process to minimize extraction hallucinations.
- Community Subgraph: Implements dynamic, label propagation-based clustering over entity relations, enables maintenance of high-level summaries with reduced cost and supports efficient context expansion as new data is ingested.
Temporal modeling records both event-based validity windows (when a fact was true in the world) and ingestion-based transactional lifetimes (when the fact was known to the agent), supporting robust historical query, edge invalidation, and provenance.
Memory Retrieval Pipeline
Zep’s retrieval system is functionally partitioned as f(α)=χ(ρ(φ(α))), where:
- φ: Hybrid search (cosine similarity, full-text BM25, and knowledge graph BFS) collecting candidate entities, facts, and communities.
- ρ: Multi-stage reranking (RRF, MMR, episode-mentions, node distance, and LLM-based cross-encoders) to prioritize informativeness and query-relevance.
- χ: Context constructor assembling a templated summary with temporal ranges, interconnected entities, and cluster overviews.
This composite operator allows highly configurable recall/precision trade-offs, latency tuning, and adaptation to specific agent context requirements.
Experimental Results
Deep Memory Retrieval (DMR)
On the DMR benchmark (multi-session QA over 60-message conversations), Zep slightly outperforms both MemGPT and a full-conversation baseline (gpt-4-turbo: 94.8% vs. 94.4%/93.4%; gpt-4o-mini: 98.2% vs. 98.0%). However, the authors argue that DMR is not sufficiently challenging given current context window sizes and real-world requirements, as performance of baseline full-context methods with modern LLMs is already near-optimal.
LongMemEval (LME)
On LongMemEvals (conversations averaging ~115k tokens), Zep yields:
- Accuracy: Improvements up to 18.5% over full-context baselines (gpt-4o, 71.2% vs. 60.2%; gpt-4o-mini, 63.8% vs. 55.4%).
- Latency: Prompt size reduction (~1.6K tokens vs. 115K tokens) results in a 90% decrease in response time (e.g., 2.58 s vs. 28.9 s).
- Question Types: Most notable gains in multi-session, temporal reasoning, preference, and knowledge-update categories. Minor regressions observed in assistant-centric queries, indicating potential areas for future specialization.
The architecture's temporal and bidirectional indexing are strongly implicated in improved multi-hop and contextual synthesis capabilities under long, cross-session memory scenarios.
Theoretical and Practical Implications
The explicit separation of episodic and semantic memory, implementation of bi-temporal edges, and multi-tiered community abstraction provide a robust foundation for next-generation agentic systems operating on open-ended, rapidly evolving domains. Zep’s attention to non-lossy data handling and explicit provenance support has strong implications for explainability and source attribution—critical in regulated or high-stakes domains.
Practically, Zep demonstrates that sophisticated temporal memory architectures can substantially reduce operational cost (token usage, latency) while maintaining or improving accuracy, enabling the deployment of persistent, long-term interactive agents in business and enterprise applications not feasible with naïve full-context approaches.
Future Directions
The paper identifies several research trajectories:
- Fine-tuned LLMs and Extraction Models: Custom models for entity and fact extraction may further reduce hallucination and improve extraction recall in open-domain settings (Choubey et al., 2024).
- Ontology Integration: Domain-specific ontologies remain underexplored in current LLM-graph literature and could enhance both node/edge resolution and semantic coherence.
- Broader Benchmarks: There is a critical need for benchmarks that test memory synthesis across structured business data and lifelike conversational histories.
- Production Metrics: Benchmarks and reporting for real deployment constraints (latency, compute, cost) remain underemphasized in research discourse.
- Hybrid Search and Summarization: Graphiti’s integration of BFS and hybrid reranking suggests further gains may be possible by combining graph traversal with hierarchical or map-reduce-style community summarization strategies.
Conclusion
Zep establishes a new high-water mark for agentic memory architectures by operationalizing bi-temporal, hierarchical knowledge graphs as the foundation for low-latency, high-accuracy, context retrieval in LLM agents. Its architecture consistently outperforms prior art on benchmarks that approximate real-world memory demands encountered in enterprise applications. The demonstrated reductions in latency and resource use, together with enhanced performance on temporally and contextually complex queries, confirm the utility of temporally-aware knowledge graphs for robust agentic memory. The research opens multiple avenues for theoretical refinement, practical benchmarking, and system deployment in persistent agent applications.