SwiftMem: Query-Aware Memory System
- SwiftMem is a query-aware agentic memory system that delivers real-time, scalable retrieval for LLM agents using a multi-tiered indexing approach.
- It leverages temporal indexing and a semantic DAG-tag structure to drastically reduce retrieval complexity and improve cache locality through embedding co-consolidation.
- Performance evaluations on benchmarks like LoCoMo and LongMemEval_S show SwiftMem achieves orders-of-magnitude speedup while maintaining competitive retrieval accuracy.
SwiftMem is a query-aware agentic memory system designed for LLM agents to enable real-time, scalable retrieval of relevant past context and episodic information. It addresses the core bottleneck of existing agentic memory frameworks—namely, exhaustive filtering or full similarity search across all stored memory, which incurs latency as the memory store grows. By introducing a multi-tiered index architecture that exploits both temporal and semantic locality, SwiftMem achieves provably sub-linear retrieval complexity while maintaining competitive retrieval accuracy on established benchmarks. The system also incorporates co-consolidation of embeddings based on tag-driven clustering to enhance hardware cache locality during similarity search. Performance evaluations demonstrate orders-of-magnitude speedup versus previous state-of-the-art systems, with minimal sacrifice in retrieval quality (Tian et al., 13 Jan 2026).
1. System Architecture and Motivation
The primary constraint in prior agentic memory systems is the linear search complexity across the memory corpus, leading to prohibitive search times (800 ms to multi-seconds with growing history) unsuited for real-time LLM agent interaction. Empirical observations show that queries exhibit strong temporal and semantic locality: most queries refer to recent or topically clustered episodes.
SwiftMem is engineered to leverage these observations with three principal design goals:
- Sub-linear retrieval complexity via index structures over both temporal and semantic axes.
- High retrieval quality, measured by LLM-judged semantic relevance and lexical overlap.
- Robust support for dynamism and growth via periodic memory reorganization.
The core indexing pipeline is organized into three tiers:
- Temporal Index: Restricts queries to relevant time intervals in .
- Semantic DAG-Tag Index: Routes queries through a hierarchical tag structure, exploiting semantic locality in .
- Embedding Index with Co-consolidation: Performs similarity search only within semantically and temporally filtered candidates, improving cache locality.
2. Temporal Indexing
The temporal index is defined as , where is a sorted list of tuples for each user and maps episode identifiers to user and timestamp metadata.
Temporal range queries are answered in via binary search:
- Locate lower and upper bounds for the interval .
- Return the set of episodes within the interval.
Insertion and maintenance are similarly efficient, supporting both single and multi-interval queries. This temporal layer ensures that time-sensitive queries are resolved without inspecting the entire memory, and it provides the first and often most significant reduction in candidate set size.
3. Semantic DAG-Tag Index
3.1 LLM-based Tag Generation
For each memory episode, an LLM is prompted to extract 3–8 normalized tags (multi-word, lowercase, with underscore separation) and parent-child relations, forming a directed acyclic graph (DAG) over the tags. If the LLM pipeline fails, fallback embedding-based keyword extraction is employed.
3.2 DAG-Tag Data Structure
The DAG assigns each tag node with attributes :
- : tag text
- : associated episodes
- : parents and children
- : embedding vector
A specificity monotonicity theorem holds: along any path in the DAG, specificity strictly increases with depth.
3.3 Query-Tag Routing
Query processing follows these steps:
- Embed the query to obtain .
- Compute cosine similarity to all tag embeddings.
- Select top- tags maximizing .
- Expand each top tag through the DAG up to depth to retrieve related tags and their associated episodes.
The complexity is , with tags and expansion to depth.
3.4 DAG Management
Tag insertion updates the DAG for every new episode in time. Query retrieval aggregates episodes associated with each relevant tag in .
4. Embedding-Tag Co-consolidation
4.1 Semantic Tag Clustering
To improve hardware cache locality in the final similarity search, SwiftMem periodically clusters the tag DAG based on connection patterns (DAG connectivity, episode co-occurrence, and connected components), forming clusters :
- : cluster ID
- : cluster members
- : centroid tag
- : cohesion score (in ), e.g.,
4.2 Co-consolidation Procedure
A physical layout map records memory offsets for embedding blocks such that tag-clustered embeddings are contiguous. Consolidation is triggered when measured fragmentation or low cohesion indicates suboptimal cache use. The process is linear in index size ( per pass).
5. Indexing and Retrieval Algorithms
5.1 New Episode Indexing
Upon receipt of a new episode :
- Insert into the temporal index: .
- Generate tags and relations via LLM: LLM cost.
- Update the DAG for each tag: .
- Insert embedding: amortized or .
Total indexing cost is .
5.2 Query Retrieval
For a query and desired results:
- Extract explicit time window if specified, restricting to in .
- Route via DAG-tag index for semantic candidates .
- Candidate episodes .
- Similarity search over : .
- Return top-.
Overall retrieval complexity is
which is sub-linear in corpus size due to aggressive candidate reduction.
6. Empirical Evaluation
SwiftMem's efficacy is demonstrated on the LoCoMo (10 dialogues, 24K tokens/dialogue, 1,540 queries) and LongMemEval_S (500 dialogues, 105K tokens/dialogue) benchmarks. Key metrics include LLM-Judge Score (GPT-4.1-mini), F1, BLEU-1, and search latency.
Performance Comparisons (LoCoMo, GPT-4.1-mini)
| Method | LLM-Score | Search latency (ms) | Total (ms) |
|---|---|---|---|
| FullContext | 0.723 | – | 5,806 |
| LangMem | 0.513 | 19,829 | 22,082 |
| Mem0 | 0.613 | 784 | 3,539 |
| RAG-4096 | 0.302 | 544 | 2,884 |
| Zep | 0.585 | 522 | 3,255 |
| Nemori | 0.721 | 835 | 3,448 |
| SwiftMem | 0.652 | 11 | 1,289 |
SwiftMem achieves a search speedup over Zep (522 ms → 11 ms) and over Nemori (835 ms → 11 ms), with – lower total latency compared to RAG-4096 and FullContext, respectively.
Retrieval Quality
| Method | LLM-Score | F1 | BLEU-1 |
|---|---|---|---|
| Nemori | 0.792 | 0.519 | 0.445 |
| SwiftMem | 0.704 | 0.429 | 0.467 |
SwiftMem displays a minor reduction in semantic alignment relative to Nemori, but increases lexical precision (BLEU-1).
7. Trade-offs, Limitations, and Prospective Extensions
Trade-offs in SwiftMem's design include increased index maintenance overhead—LLM-driven tag generation and DAG updates amortize as writes accrue. The space requirement for multi-dimensional indices (tag embeddings, pointers, timelines) is higher than brute-force baselines.
Limitations arise from dependency on LLM-generated tag quality, which, if insufficiently granular or accurate, can diminish semantic routing efficacy. Fixed and parameters may not optimize performance across all query distributions. Co-consolidation relies on effective scheduling; poor timing can yield suboptimal clustering.
Potential extensions include:
- Adaptive and per query via uncertainty or importance estimation.
- Approximate nearest neighbor structures over tags to further minimize dependence.
- Hierarchical temporal trees for finer-grained time-based queries.
- Online DAG pruning/merging to control tag explosion at scale.
- Unified indices supporting heterogeneous memory types (procedural, resource).
SwiftMem's three-tier indexing (temporal, semantic, embedding) with periodic co-consolidation produces provable sub-linear retrieval and large practical speedups, with competitive accuracy on long-context evaluation settings (Tian et al., 13 Jan 2026).