Papers
Topics
Authors
Recent
Search
2000 character limit reached

SwiftMem: Query-Aware Memory System

Updated 20 January 2026
  • SwiftMem is a query-aware agentic memory system that delivers real-time, scalable retrieval for LLM agents using a multi-tiered indexing approach.
  • It leverages temporal indexing and a semantic DAG-tag structure to drastically reduce retrieval complexity and improve cache locality through embedding co-consolidation.
  • Performance evaluations on benchmarks like LoCoMo and LongMemEval_S show SwiftMem achieves orders-of-magnitude speedup while maintaining competitive retrieval accuracy.

SwiftMem is a query-aware agentic memory system designed for LLM agents to enable real-time, scalable retrieval of relevant past context and episodic information. It addresses the core bottleneck of existing agentic memory frameworks—namely, exhaustive filtering or full similarity search across all stored memory, which incurs O(Nmem)O(N_{\text{mem}}) latency as the memory store grows. By introducing a multi-tiered index architecture that exploits both temporal and semantic locality, SwiftMem achieves provably sub-linear retrieval complexity while maintaining competitive retrieval accuracy on established benchmarks. The system also incorporates co-consolidation of embeddings based on tag-driven clustering to enhance hardware cache locality during similarity search. Performance evaluations demonstrate orders-of-magnitude speedup versus previous state-of-the-art systems, with minimal sacrifice in retrieval quality (Tian et al., 13 Jan 2026).

1. System Architecture and Motivation

The primary constraint in prior agentic memory systems is the linear search complexity across the memory corpus, leading to prohibitive search times (800 ms to multi-seconds with growing history) unsuited for real-time LLM agent interaction. Empirical observations show that queries exhibit strong temporal and semantic locality: most queries refer to recent or topically clustered episodes.

SwiftMem is engineered to leverage these observations with three principal design goals:

  • Sub-linear retrieval complexity via index structures over both temporal and semantic axes.
  • High retrieval quality, measured by LLM-judged semantic relevance and lexical overlap.
  • Robust support for dynamism and growth via periodic memory reorganization.

The core indexing pipeline is organized into three tiers:

  1. Temporal Index: Restricts queries to relevant time intervals in O(logNmem)O(\log N_{\text{mem}}).
  2. Semantic DAG-Tag Index: Routes queries through a hierarchical tag structure, exploiting semantic locality in O(k(logV+Dmax))O(k \cdot (\log |V| + D_{\max})).
  3. Embedding Index with Co-consolidation: Performs similarity search only within semantically and temporally filtered candidates, improving cache locality.

2. Temporal Indexing

The temporal index is defined as T={Lu,M}\mathcal{T} = \{\mathcal{L}_u, \mathcal{M}\}, where Lu\mathcal{L}_u is a sorted list of (ti,ei)(t_i, e_i) tuples for each user uu and M\mathcal{M} maps episode identifiers to user and timestamp metadata.

Temporal range queries are answered in O(logNmem)O(\log N_{\text{mem}}) via binary search:

  1. Locate lower and upper bounds for the interval [tlow,thigh][t_{\text{low}}, t_{\text{high}}].
  2. Return the set of episodes within the interval.

Insertion and maintenance are similarly efficient, supporting both single and multi-interval queries. This temporal layer ensures that time-sensitive queries are resolved without inspecting the entire memory, and it provides the first and often most significant reduction in candidate set size.

3. Semantic DAG-Tag Index

3.1 LLM-based Tag Generation

For each memory episode, an LLM is prompted to extract 3–8 normalized tags (multi-word, lowercase, with underscore separation) and parent-child relations, forming a directed acyclic graph (DAG) over the tags. If the LLM pipeline fails, fallback embedding-based keyword extraction is employed.

3.2 DAG-Tag Data Structure

The DAG G=(V,E)G = (V, E) assigns each tag node vVv \in V with attributes (t,E,P,C,e)(t, \mathcal{E}, \mathcal{P}, \mathcal{C}, \mathbf{e}):

  • tt: tag text
  • E\mathcal{E}: associated episodes
  • P,C\mathcal{P}, \mathcal{C}: parents and children
  • eRd\mathbf{e} \in \mathbb{R}^d: embedding vector

A specificity monotonicity theorem holds: along any path in the DAG, specificity S(v)\mathcal{S}(v) strictly increases with depth.

3.3 Query-Tag Routing

Query processing follows these steps:

  1. Embed the query qq to obtain eq\mathbf{e}_q.
  2. Compute cosine similarity s(q,t)s(q, t) to all tag embeddings.
  3. Select top-kk tags maximizing s(q,t)\sum s(q, t).
  4. Expand each top tag through the DAG up to depth DmaxD_{\max} to retrieve related tags and their associated episodes.

The complexity is O(k(logV+Dmax))O(k (\log|V| + D_{\max})), with V|V| tags and expansion to DmaxD_{\max} depth.

3.4 DAG Management

Tag insertion updates the DAG for every new episode in O(#tagslogV)O(\#\text{tags} \cdot \log|V|) time. Query retrieval aggregates episodes associated with each relevant tag in O(k(logV+Dmax)+Ecand)O(k (\log|V| + D_{\max}) + |E_{\text{cand}}|).

4. Embedding-Tag Co-consolidation

4.1 Semantic Tag Clustering

To improve hardware cache locality in the final similarity search, SwiftMem periodically clusters the tag DAG based on connection patterns (DAG connectivity, episode co-occurrence, and connected components), forming clusters Ci=(Ii,Vi,ti,si)C_i = (I_i, V_i, t_i, s_i):

  • IiI_i: cluster ID
  • ViVV_i \subseteq V: cluster members
  • tit_i: centroid tag
  • sis_i: cohesion score (in [0,1][0,1]), e.g., si=#actual edgesVi(Vi1)/2s_i = \frac{\#\text{actual edges}}{|V_i|(|V_i|-1)/2}

4.2 Co-consolidation Procedure

A physical layout map LL records memory offsets for embedding blocks such that tag-clustered embeddings are contiguous. Consolidation is triggered when measured fragmentation or low cohesion indicates suboptimal cache use. The process is linear in index size (O(V+total episodes)O(|V| + \text{total episodes}) per pass).

5. Indexing and Retrieval Algorithms

5.1 New Episode Indexing

Upon receipt of a new episode e=(u,m,t,x)e = (u, m, t, \vec{x}):

  1. Insert into the temporal index: O(logNmem)O(\log N_{\text{mem}}).
  2. Generate tags and relations via LLM: O(1)O(1) LLM cost.
  3. Update the DAG for each tag: O(#tagslogV)O(\#\text{tags} \cdot \log|V|).
  4. Insert embedding: amortized O(1)O(1) or O(logNmem)O(\log N_{\text{mem}}).

Total indexing cost is O(logNmem+#tagslogV)O(\log N_{\text{mem}} + \#\text{tags} \cdot \log|V|).

5.2 Query Retrieval

For a query qq and desired kqk_q results:

  1. Extract explicit time window if specified, restricting to EtempE_{\text{temp}} in O(logNmem+Etemp)O(\log N_{\text{mem}} + |E_{\text{temp}}|).
  2. Route via DAG-tag index for semantic candidates EsemiE_{\text{semi}}.
  3. Candidate episodes Ecand=EtempEsemiE_{\text{cand}} = E_{\text{temp}} \cap E_{\text{semi}}.
  4. Similarity search over EcandE_{\text{cand}}: O(EcandlogNmem)O(|E_{\text{cand}}| \cdot \log N_{\text{mem}}).
  5. Return top-kqk_q.

Overall retrieval complexity is

O(logNmem+k(logV+Dmax)+EcandlogNmem)O(\log N_{\text{mem}} + k(\log|V|+D_{\max}) + |E_{\text{cand}}| \cdot \log N_{\text{mem}})

which is sub-linear in corpus size due to aggressive candidate reduction.

6. Empirical Evaluation

SwiftMem's efficacy is demonstrated on the LoCoMo (10 dialogues, \sim24K tokens/dialogue, 1,540 queries) and LongMemEval_S (500 dialogues, \sim105K tokens/dialogue) benchmarks. Key metrics include LLM-Judge Score (GPT-4.1-mini), F1, BLEU-1, and search latency.

Performance Comparisons (LoCoMo, GPT-4.1-mini)

Method LLM-Score Search latency (ms) Total (ms)
FullContext 0.723 5,806
LangMem 0.513 19,829 22,082
Mem0 0.613 784 3,539
RAG-4096 0.302 544 2,884
Zep 0.585 522 3,255
Nemori 0.721 835 3,448
SwiftMem 0.652 11 1,289

SwiftMem achieves a 47×47\times search speedup over Zep (522 ms → 11 ms) and 76×76\times over Nemori (835 ms → 11 ms), with 2.2×2.2\times4.5×4.5\times lower total latency compared to RAG-4096 and FullContext, respectively.

Retrieval Quality

Method LLM-Score F1 BLEU-1
Nemori 0.792 0.519 0.445
SwiftMem 0.704 0.429 0.467

SwiftMem displays a minor reduction in semantic alignment relative to Nemori, but increases lexical precision (BLEU-1).

7. Trade-offs, Limitations, and Prospective Extensions

Trade-offs in SwiftMem's design include increased index maintenance overhead—LLM-driven tag generation and DAG updates amortize as writes accrue. The space requirement for multi-dimensional indices (tag embeddings, pointers, timelines) is higher than brute-force baselines.

Limitations arise from dependency on LLM-generated tag quality, which, if insufficiently granular or accurate, can diminish semantic routing efficacy. Fixed kk and DmaxD_{\max} parameters may not optimize performance across all query distributions. Co-consolidation relies on effective scheduling; poor timing can yield suboptimal clustering.

Potential extensions include:

  • Adaptive kk and DmaxD_{\max} per query via uncertainty or importance estimation.
  • Approximate nearest neighbor structures over tags to further minimize V|V| dependence.
  • Hierarchical temporal trees for finer-grained time-based queries.
  • Online DAG pruning/merging to control tag explosion at scale.
  • Unified indices supporting heterogeneous memory types (procedural, resource).

SwiftMem's three-tier indexing (temporal, semantic, embedding) with periodic co-consolidation produces provable sub-linear retrieval and large practical speedups, with competitive accuracy on long-context evaluation settings (Tian et al., 13 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SwiftMem.