SwiftMem: Query-Aware Memory System

Updated 20 January 2026

SwiftMem is a query-aware agentic memory system that delivers real-time, scalable retrieval for LLM agents using a multi-tiered indexing approach.
It leverages temporal indexing and a semantic DAG-tag structure to drastically reduce retrieval complexity and improve cache locality through embedding co-consolidation.
Performance evaluations on benchmarks like LoCoMo and LongMemEval_S show SwiftMem achieves orders-of-magnitude speedup while maintaining competitive retrieval accuracy.

SwiftMem is a query-aware agentic memory system designed for LLM agents to enable real-time, scalable retrieval of relevant past context and episodic information. It addresses the core bottleneck of existing agentic memory frameworks—namely, exhaustive filtering or full similarity search across all stored memory, which incurs $O(N_{\text{mem}})$ latency as the memory store grows. By introducing a multi-tiered index architecture that exploits both temporal and semantic locality, SwiftMem achieves provably sub-linear retrieval complexity while maintaining competitive retrieval accuracy on established benchmarks. The system also incorporates co-consolidation of embeddings based on tag-driven clustering to enhance hardware cache locality during similarity search. Performance evaluations demonstrate orders-of-magnitude speedup versus previous state-of-the-art systems, with minimal sacrifice in retrieval quality (Tian et al., 13 Jan 2026).

1. System Architecture and Motivation

The primary constraint in prior agentic memory systems is the linear search complexity across the memory corpus, leading to prohibitive search times (800 ms to multi-seconds with growing history) unsuited for real-time LLM agent interaction. Empirical observations show that queries exhibit strong temporal and semantic locality: most queries refer to recent or topically clustered episodes.

SwiftMem is engineered to leverage these observations with three principal design goals:

Sub-linear retrieval complexity via index structures over both temporal and semantic axes.
High retrieval quality, measured by LLM-judged semantic relevance and lexical overlap.
Robust support for dynamism and growth via periodic memory reorganization.

The core indexing pipeline is organized into three tiers:

Temporal Index: Restricts queries to relevant time intervals in $O(\log N_{\text{mem}})$ .
Semantic DAG-Tag Index: Routes queries through a hierarchical tag structure, exploiting semantic locality in $O(k \cdot (\log |V| + D_{\max}))$ .
Embedding Index with Co-consolidation: Performs similarity search only within semantically and temporally filtered candidates, improving cache locality.

2. Temporal Indexing

The temporal index is defined as $\mathcal{T} = \{\mathcal{L}_u, \mathcal{M}\}$ , where $\mathcal{L}_u$ is a sorted list of $(t_i, e_i)$ tuples for each user $u$ and $\mathcal{M}$ maps episode identifiers to user and timestamp metadata.

Temporal range queries are answered in $O(\log N_{\text{mem}})$ via binary search:

Locate lower and upper bounds for the interval $[t_{\text{low}}, t_{\text{high}}]$ .
Return the set of episodes within the interval.

Insertion and maintenance are similarly efficient, supporting both single and multi-interval queries. This temporal layer ensures that time-sensitive queries are resolved without inspecting the entire memory, and it provides the first and often most significant reduction in candidate set size.

3. Semantic DAG-Tag Index

3.1 LLM-based Tag Generation

For each memory episode, an LLM is prompted to extract 3–8 normalized tags (multi-word, lowercase, with underscore separation) and parent-child relations, forming a directed acyclic graph (DAG) over the tags. If the LLM pipeline fails, fallback embedding-based keyword extraction is employed.

3.2 DAG-Tag Data Structure

The DAG $O(\log N_{\text{mem}})$ 0 assigns each tag node $O(\log N_{\text{mem}})$ 1 with attributes $O(\log N_{\text{mem}})$ 2:

$O(\log N_{\text{mem}})$ 3: tag text
$O(\log N_{\text{mem}})$ 4: associated episodes
$O(\log N_{\text{mem}})$ 5: parents and children
$O(\log N_{\text{mem}})$ 6: embedding vector

A specificity monotonicity theorem holds: along any path in the DAG, specificity $O(\log N_{\text{mem}})$ 7 strictly increases with depth.

3.3 Query-Tag Routing

Query processing follows these steps:

Embed the query $O(\log N_{\text{mem}})$ 8 to obtain $O(\log N_{\text{mem}})$ 9.
Compute cosine similarity $O(k \cdot (\log |V| + D_{\max}))$ 0 to all tag embeddings.
Select top- $O(k \cdot (\log |V| + D_{\max}))$ 1 tags maximizing $O(k \cdot (\log |V| + D_{\max}))$ 2.
Expand each top tag through the DAG up to depth $O(k \cdot (\log |V| + D_{\max}))$ 3 to retrieve related tags and their associated episodes.

The complexity is $O(k \cdot (\log |V| + D_{\max}))$ 4, with $O(k \cdot (\log |V| + D_{\max}))$ 5 tags and expansion to $O(k \cdot (\log |V| + D_{\max}))$ 6 depth.

3.4 DAG Management

Tag insertion updates the DAG for every new episode in $O(k \cdot (\log |V| + D_{\max}))$ 7 time. Query retrieval aggregates episodes associated with each relevant tag in $O(k \cdot (\log |V| + D_{\max}))$ 8.

4. Embedding-Tag Co-consolidation

4.1 Semantic Tag Clustering

To improve hardware cache locality in the final similarity search, SwiftMem periodically clusters the tag DAG based on connection patterns (DAG connectivity, episode co-occurrence, and connected components), forming clusters $O(k \cdot (\log |V| + D_{\max}))$ 9:

$\mathcal{T} = \{\mathcal{L}_u, \mathcal{M}\}$ 0: cluster ID
$\mathcal{T} = \{\mathcal{L}_u, \mathcal{M}\}$ 1: cluster members
$\mathcal{T} = \{\mathcal{L}_u, \mathcal{M}\}$ 2: centroid tag
$\mathcal{T} = \{\mathcal{L}_u, \mathcal{M}\}$ 3: cohesion score (in $\mathcal{T} = \{\mathcal{L}_u, \mathcal{M}\}$ 4), e.g., $\mathcal{T} = \{\mathcal{L}_u, \mathcal{M}\}$ 5

4.2 Co-consolidation Procedure

A physical layout map $\mathcal{T} = \{\mathcal{L}_u, \mathcal{M}\}$ 6 records memory offsets for embedding blocks such that tag-clustered embeddings are contiguous. Consolidation is triggered when measured fragmentation or low cohesion indicates suboptimal cache use. The process is linear in index size ( $\mathcal{T} = \{\mathcal{L}_u, \mathcal{M}\}$ 7 per pass).

5. Indexing and Retrieval Algorithms

5.1 New Episode Indexing

Upon receipt of a new episode $\mathcal{T} = \{\mathcal{L}_u, \mathcal{M}\}$ 8:

Insert into the temporal index: $\mathcal{T} = \{\mathcal{L}_u, \mathcal{M}\}$ 9.
Generate tags and relations via LLM: $\mathcal{L}_u$ 0 LLM cost.
Update the DAG for each tag: $\mathcal{L}_u$ 1.
Insert embedding: amortized $\mathcal{L}_u$ 2 or $\mathcal{L}_u$ 3.

Total indexing cost is $\mathcal{L}_u$ 4.

5.2 Query Retrieval

For a query $\mathcal{L}_u$ 5 and desired $\mathcal{L}_u$ 6 results:

Extract explicit time window if specified, restricting to $\mathcal{L}_u$ 7 in $\mathcal{L}_u$ 8.
Route via DAG-tag index for semantic candidates $\mathcal{L}_u$ 9.
Candidate episodes $(t_i, e_i)$ 0.
Similarity search over $(t_i, e_i)$ 1: $(t_i, e_i)$ 2.
Return top- $(t_i, e_i)$ 3.

Overall retrieval complexity is

$(t_i, e_i)$ 4

which is sub-linear in corpus size due to aggressive candidate reduction.

6. Empirical Evaluation

SwiftMem's efficacy is demonstrated on the LoCoMo (10 dialogues, $(t_i, e_i)$ 524K tokens/dialogue, 1,540 queries) and LongMemEval_S (500 dialogues, $(t_i, e_i)$ 6105K tokens/dialogue) benchmarks. Key metrics include LLM-Judge Score (GPT-4.1-mini), F1, BLEU-1, and search latency.

Performance Comparisons (LoCoMo, GPT-4.1-mini)

Method	LLM-Score	Search latency (ms)	Total (ms)
FullContext	0.723	–	5,806
LangMem	0.513	19,829	22,082
Mem0	0.613	784	3,539
RAG-4096	0.302	544	2,884
Zep	0.585	522	3,255
Nemori	0.721	835	3,448
SwiftMem	0.652	11	1,289

SwiftMem achieves a $(t_i, e_i)$ 7 search speedup over Zep (522 ms → 11 ms) and $(t_i, e_i)$ 8 over Nemori (835 ms → 11 ms), with $(t_i, e_i)$ 9– $u$ 0 lower total latency compared to RAG-4096 and FullContext, respectively.

Retrieval Quality

Method	LLM-Score	F1	BLEU-1
Nemori	0.792	0.519	0.445
SwiftMem	0.704	0.429	0.467

SwiftMem displays a minor reduction in semantic alignment relative to Nemori, but increases lexical precision (BLEU-1).

7. Trade-offs, Limitations, and Prospective Extensions

Trade-offs in SwiftMem's design include increased index maintenance overhead—LLM-driven tag generation and DAG updates amortize as writes accrue. The space requirement for multi-dimensional indices (tag embeddings, pointers, timelines) is higher than brute-force baselines.

Limitations arise from dependency on LLM-generated tag quality, which, if insufficiently granular or accurate, can diminish semantic routing efficacy. Fixed $u$ 1 and $u$ 2 parameters may not optimize performance across all query distributions. Co-consolidation relies on effective scheduling; poor timing can yield suboptimal clustering.

Potential extensions include:

Adaptive $u$ 3 and $u$ 4 per query via uncertainty or importance estimation.
Approximate nearest neighbor structures over tags to further minimize $u$ 5 dependence.
Hierarchical temporal trees for finer-grained time-based queries.
Online DAG pruning/merging to control tag explosion at scale.
Unified indices supporting heterogeneous memory types (procedural, resource).

SwiftMem's three-tier indexing (temporal, semantic, embedding) with periodic co-consolidation produces provable sub-linear retrieval and large practical speedups, with competitive accuracy on long-context evaluation settings (Tian et al., 13 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SwiftMem: Fast Agentic Memory via Query-aware Indexing (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SwiftMem.