Papers
Topics
Authors
Recent
Search
2000 character limit reached

ShardMemo: Cost-Aware Tiered Memory for LLMs

Updated 5 February 2026
  • ShardMemo is a tiered memory system with rigid budgets for working memory, persistent evidence, and procedural skills, ensuring predictable resource use.
  • It employs masked MoE routing with cost-aware gating and adaptive probe selection, significantly enhancing retrieval quality and efficiency.
  • Empirical results show ShardMemo outperforms previous systems on benchmarks like LoCoMo and HotpotQA, reducing latency and vector scans.

ShardMemo is a tiered, cost-aware external memory system for agentic LLM platforms, designed to deliver scalable, predictable, and budgeted memory retrieval in scenarios involving large, persistent evidence stores and procedural skill libraries. Its architecture addresses core bottlenecks encountered with centralized memory indexes and static partition strategies, particularly as memory volume and concurrent multi-agent execution increase. ShardMemo introduces strict scope constraints, cost-controlled routing, and sharded approximate nearest neighbor (ANN) architectures organized across three functional tiers, with explicit mechanisms for both agent-session state and reusable skills. Empirically, ShardMemo demonstrates significant retrieval quality and efficiency gains over previous agentic LLM memory systems such as GAM, notably on benchmarks like LoCoMo, HotpotQA, and ToolBench (Zhao et al., 29 Jan 2026).

1. Architectural Overview and Formal Specification

ShardMemo is constructed as a three-tiered memory service. Each request qtq_t is processed through tiers with independent budgets and scope filters:

  • Tier A: Maintains bounded agent- or session-specific "working memory" under a strict token cap MM, typically for short-lived notes or intermediate plans. For a request qtq_t, working memory is selected via

At=ReadA(qt,ψtA),AtM,A_t = \mathrm{ReadA}(q_t, \psi^A_t),\quad |A_t| \leq M,

where ψtA\psi^A_t is a boolean scope predicate.

  • Tier B: Handles persistent evidence in SS ANN-indexed shards, probing at most BprobeB_{\mathrm{probe}} eligible shards in parallel for each query and merging their results into a unified Top-KK evidence set. Scope predicates ψtB\psi^B_t mask ineligible shards before any scoring occurs, ensuring that ANN search and scoring are strictly limited to eligible data defined by metadata mjm_j per shard.
  • Tier C: Stores versioned, schema-validated procedural "skills" (e.g., tool-call templates plus validation tests), organized as a library MM0. Skills are retrieved under a strict step budget MM1 via similarity search, with failures or inapplicability defaulting to fallback evidence retrieval through Tier B.

Each request embeds scope predicates MM2 and budgets MM3, tightly controlling both access and resource consumption (Zhao et al., 29 Jan 2026).

2. Masked Mixture-of-Experts (MoE) Routing and Scope Enforcement

ShardMemo enforces a "scope-before-routing" guarantee by masking out ineligible Tier B shards from both routing and ANN search. For eligible shards MM4, the system computes masked MoE gating scores: MM5 where MM6 concatenates the embedded query MM7 with additional structured features MM8, MM9 is a learned summary for shard qtq_t0, and qtq_t1 estimates per-shard cost (such as I/O or scan count). The tradeoff parameter qtq_t2 adjusts cost aversion.

Normalized handling probabilities for each shard are given by

qtq_t3

with ineligible shards strictly excluded from consideration.

Routing employs fixed Top-qtq_t4 or adaptive Top-qtq_t5 selection:

  • Fixed Top-qtq_t6: Probes the qtq_t7 highest-scoring shards by qtq_t8.
  • Adaptive Top-qtq_t9: Sorts by At=ReadA(qt,ψtA),AtM,A_t = \mathrm{ReadA}(q_t, \psi^A_t),\quad |A_t| \leq M,0, selects the smallest probe count At=ReadA(qt,ψtA),AtM,A_t = \mathrm{ReadA}(q_t, \psi^A_t),\quad |A_t| \leq M,1 covering a dynamic cumulative probability threshold At=ReadA(qt,ψtA),AtM,A_t = \mathrm{ReadA}(q_t, \psi^A_t),\quad |A_t| \leq M,2 (clipped between At=ReadA(qt,ψtA),AtM,A_t = \mathrm{ReadA}(q_t, \psi^A_t),\quad |A_t| \leq M,3 and At=ReadA(qt,ψtA),AtM,A_t = \mathrm{ReadA}(q_t, \psi^A_t),\quad |A_t| \leq M,4), subject to the global probe cap.

This adaptivity allows confident queries to probe fewer shards, while uncertainty triggers broader shard activation, always bounded by At=ReadA(qt,ψtA),AtM,A_t = \mathrm{ReadA}(q_t, \psi^A_t),\quad |A_t| \leq M,5 (Zhao et al., 29 Jan 2026).

3. Parallel Shard-Local ANN Retrieval and Global Merging

For the selected probe set At=ReadA(qt,ψtA),AtM,A_t = \mathrm{ReadA}(q_t, \psi^A_t),\quad |A_t| \leq M,6, ShardMemo executes each shard-local ANN index At=ReadA(qt,ψtA),AtM,A_t = \mathrm{ReadA}(q_t, \psi^A_t),\quad |A_t| \leq M,7 in parallel, unions resulting candidate sets, re-applies scope filters, and returns the global Top-At=ReadA(qt,ψtA),AtM,A_t = \mathrm{ReadA}(q_t, \psi^A_t),\quad |A_t| \leq M,8 results under the request budget: At=ReadA(qt,ψtA),AtM,A_t = \mathrm{ReadA}(q_t, \psi^A_t),\quad |A_t| \leq M,9 This distributed retrieval prevents bottlenecks associated with centralized indexes and enables efficient, cost-controlled evidence selection for downstream agentic tasks.

When training data is available, evidence-to-shard supervision ψtA\psi^A_t0 is used to optimize the router, concentrating probability mass on "gold" shards via a multi-positive set-likelihood loss: ψtA\psi^A_t1 This generalizes cross-entropy and directly enhances the hit rates of the routing mechanism for both fixed and adaptive strategies (Zhao et al., 29 Jan 2026).

4. Procedural Skill Library and Fallback

Tier C of ShardMemo maintains a library ψtA\psi^A_t2 of versioned, schema-checked skills, each comprising tool-call templates and associated deterministic validation tests. Retrieval occurs under a schema/tool scope filter ψtA\psi^A_t3 and a tight skill budget ψtA\psi^A_t4: ψtA\psi^A_t5 Skill execution is attempted with slot filling; if retrieval fails or an applicable skill is not found, the request is transparently passed to Tier B for evidence retrieval using the same probe and result budgets. This design ensures robust availability of both procedural and evidentiary retrieval under dynamic, budgeted constraints (Zhao et al., 29 Jan 2026).

5. Empirical Results and Performance Analysis

ShardMemo achieves significant improvements over prior memory systems by combining strict eligibility masking, cost-aware MoE gating, and parallel sharded ANN. On the LoCoMo conversational-memory benchmark using GPT-OSS-120B, it outperforms the strongest baseline (GAM) with gains of +5.70 F1 (single-hop), +5.11 F1 (multi-hop), +6.82 F1 (temporal), and +6.03 F1 (open-domain). Under a fixed-probe regime (ψtA\psi^A_t6), ShardMemo lifts ShardHit@3 from 0.67 (cosine-to-prototype) to 0.82, reduces average per-query vector scans from 521 to 414 (–20.5%), and lowers p95 latency from 95 ms to 76 ms (–19.8 ms).

On HotpotQA with large context windows (56K–448K tokens), ShardMemo delivers F1 scores of 63.41/61.88/57.95 across input lengths, consistently outperforming GAM by +1.31, +0.96, and +0.55 F1, respectively. ToolBench benchmarks demonstrate Tier C skill retrieval yields Precision@3 of 0.97 (+10.2% over embedding-similarity) and StepRed of 1.94 (+7.2%), with mean retrieval latency reduced by ~19% at ψtA\psi^A_t7. Budget sweeps always situate ShardMemo on a better accuracy–efficiency trade-off curve compared to both centralized and naively partitioned retrieval baselines (Zhao et al., 29 Jan 2026).

6. Scope Guarantees, Predictable Budgets, and ShardMemo’s Significance

By implementing mandatory eligibility masking ahead of all semantic scoring or ANN access, ShardMemo enforces strong multi-tenant, schema, and tool-level safety, never exposing out-of-scope data to routing or retrieval logic. Cost-aware gating and strict probe/result caps deliver predictable per-request resource consumption, essential for concurrent, high-volume agentic LLM deployments. The separation of working state, persistent evidence, and procedural skills into dedicated, budgeted tiers enables composability and system reliability as system complexity and parallelism scale. This formalization and empirical performance indicate ShardMemo’s role in advancing scalable, agentic external memory for LLMs beyond the limitations of central indexes and static partitions (Zhao et al., 29 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ShardMemo.