BudgetMem: Modular Memory for LLM Agents

Updated 8 February 2026

BudgetMem is a modular memory-augmentation framework for LLM agents that provides query-aware trade-offs between performance and resource usage.
It employs a modular architecture with tunable implementation, reasoning, and capacity tiers, orchestrated via a per-query neural router using reinforcement learning.
Empirical benchmarks demonstrate smooth accuracy-cost Pareto frontiers, with significant performance improvements and resource optimizations under varying budget constraints.

BudgetMem is a modular memory-augmentation framework for LLM agents, designed to deliver explicit, query-aware trade-offs between performance and resource usage at runtime. Mimicking a "budget ladder," BudgetMem enables each stage of memory extraction or construction to operate under tunable computational or cost constraints, balancing inference quality against budgetary limits in on-demand query answering and long-horizon reasoning over extended contexts. The framework introduces learnable, per-query routing over three orthogonal "budget tiers": implementation complexity, reasoning style, and model capacity, yielding a unified system that adapts to operational requirements, application demands, and user-defined service level agreements (Zhang et al., 5 Feb 2026, Alla et al., 7 Nov 2025, Qian et al., 12 Jan 2026).

1. Architectural Organization and Modular Pipeline

BudgetMem acts as an intermediary between a retriever and the final answer generator in an LLM agent. Given a long input history $H$ , it segments the text into fixed-size chunks $C$ . At query time, a retriever $R(q, C)$ generates a candidate snippet set $C_q$ . This candidate set is then processed through a memory extraction pipeline comprising the following modules: Filtering ( $M_{\text{fil}}$ ), parallel Extraction ( $\{M_{\text{ent}}, M_{\text{tmp}}, M_{\text{top}}\}$ ), Summarization ( $M_{\text{sum}}$ ), culminating in the Answer Generator ( $f_{\text{ans}}$ ) (Zhang et al., 5 Feb 2026). Each module exposes three distinct budget tiers—LOW, MID, HIGH—all sharing the same input–output contract but realized through computationally divergent procedures.

A per-query neural router ( $\pi_\theta$ ) orchestrates passage through these modules. At each module invocation $k$ , the router observes a state $C$ 0 composed of the query, intermediate context, and a learned module identifier, selecting an action $C$ 1 {LOW, MID, HIGH}, which determines the budget tier used (Zhang et al., 5 Feb 2026).

2. Budget Tiers: Implementation, Reasoning, and Capacity

BudgetMem introduces three orthogonal axes for constructing the LOW–MID–HIGH budget ladder:

Implementation Tiering: Varies algorithmic complexity per module, e.g., heuristics (spaCy-based NER) at LOW, compact models (fine-tuned BERT relation extractor) at MID, and full LLM prompting at HIGH. For entity extraction ( $C$ 2), the token cost scales from zero (LOW) to approximately $C$ 3 (HIGH).
Reasoning Tiering: Fixes model backbone but modifies inference behavior, ranging from direct outputs (LOW), to chain-of-thought (MID), to multi-step or reflection-based inference (HIGH). This axis modulates token overhead (e.g., 100, 350, 600 tokens for entity extraction across tiers).
Capacity Tiering: Fixes inference style and implementation but scales model size, e.g., from qwen2.5B-instruct (LOW) up to qwen3-80B (HIGH). Costs scale with model size, and so does extractive performance.

Each axis targets a distinct cost-quality regime: implementation tiering for minimal compute, reasoning tiering for fine-grained quality within a tight cost band, and capacity tiering for maximizing quality under relaxed budgets. All tiering strategies maintain the pipeline’s modular input–output compatibility (Zhang et al., 5 Feb 2026).

3. Neural Routing Policy and Reinforcement Learning Formulation

The tier-selection problem is framed as an episodic reinforcement learning (RL) task. Each episode is a pipeline pass for a given query $C$ 4. At invocation $C$ 5, the router observes state $C$ 6 and selects tier $C$ 7. The reward combines normalized task performance ( $C$ 8, e.g., Judge/F1) and normalized cost ( $C$ 9), balanced by a scaling factor $R(q, C)$ 0 and a user-specified $R(q, C)$ 1 controlling cost emphasis:

$R(q, C)$ 2

Router policy $R(q, C)$ 3 is trained with Proximal Policy Optimization (PPO) to maximize expected reward, obeying the standard clipped surrogate objective. This formulation enables query-adaptive routing under non-differentiable module calls, facilitating fast policy learning and operational flexibility (Zhang et al., 5 Feb 2026).

4. Empirical Analysis: Quality–Cost Pareto Frontiers and Trade-offs

Benchmarks on LoCoMo, LongMemEval, and HotpotQA confirm that, across all axes, BudgetMem surpasses strong baselines in high-budget regimes and produces smooth accuracy–cost Pareto frontiers as budgets tighten ( $R(q, C)$ 4 increases). Quantitative results include:

Performance-first (α = 0) regime: BudgetMem outperforms previous baselines (e.g., Judge score increases LongMemEval from ~48% to ~61% at approximately $R(q, C)$ 5 extraction cost).
Tightening budgets (increasing α): Quality drops gracefully while cost may fall by up to 80%. For long documents, F1 drops only 1% with a 72.4% reduction in memory footprint (Alla et al., 7 Nov 2025).
Tiering axis efficacy: Implementation tiering yields rapid cost reductions at low–mid budgets, reasoning tiering affords fine control over inference cost, and capacity tiering covers the widest cost–quality span, particularly relevant for premium tiers or quality-sensitive deployments (Zhang et al., 5 Feb 2026).

Retrieval stage sensitivity analysis demonstrates that extracting more than five candidate chunks provides diminishing returns and may reduce overall performance, emphasizing the necessity of downstream–retrieval budget alignment.

5. Selective Memory, Salience Scoring, and Gating

BudgetMem’s selective memory policies enable fixed-window LLMs to process arbitrarily long documents by learning what to retain. Each text chunk is scored for salience using interpretable features: entity density, average TF-IDF, discourse markers, position bias, and numeric content. A learned gating mechanism (write policy) enforces the storage budget by computing

$R(q, C)$ 6

Only the $R(q, C)$ 7 highest-scoring chunks are stored, where $R(q, C)$ 8 is the chunk or token budget. At query time, BM25 sparse retrieval (optionally hybridized with dense retrieval and reranking) fetches the most relevant chunks, which are then packed into the LLM prompt (Alla et al., 7 Nov 2025).

Ablation studies demonstrate that entity density and position bias are the most influential features for optimizing F1 under a fixed memory budget, and that learned selective gating outperforms naive heuristics, preserving up to 99% task F1 with less than 28% of original memory usage for long documents.

6. Integration with Executive Memory and Long-Horizon Reasoning

Extensions such as MemoBrain realize BudgetMem principles in multi-step, tool-augmented reasoning agents (Qian et al., 12 Jan 2026). Here, an executive memory module constructs a dependency graph of “thought” nodes, assigns salience, and maintains a compact “reasoning backbone” under fixed token budgets using greedy or beam-based knapsack selection algorithms. Memory management employs pruning (“flush”) for low-salience nodes and folding for completed sub-trajectories, ensuring the working context never exceeds the maximal allowed token window while retaining salient, logically-dependent reasoning milestones.

Integration with the agent’s workflow is asynchronous (“copilot-style”), interleaving memory updates with reasoning, thus minimizing latency overhead. Experiments on GAIA, WebWalkerQA, and BrowseComp-Plus demonstrate consistent Pass@1 improvements—e.g., DeepResearch-30B-A3B: 68.9 → MemoBrain: 74.5 (GAIA)—and allow 2–3× more tool calls before memory truncation.

7. Practical Deployment and Guidelines

BudgetMem offers explicit deployment guidelines (Zhang et al., 5 Feb 2026):

For latency/cost-sensitive scenarios (e.g., edge inference), prioritize implementation tiering and set a high $R(q, C)$ 9 to maximize use of rule-based modules.
In balanced settings (e.g., chatbots under limited compute), reasoning tiering with $C_q$ 0 in $C_q$ 1– $C_q$ 2 delivers quality gains at modest cost increments.
For quality-critical use cases (e.g., complex multi-hop QA), capacity tiering with low $C_q$ 3 enables large models in all modules, matching or surpassing offline memory baselines.
Production systems can dynamically tune $C_q$ 4 at the query or user level to provide differentiated service.

BudgetMem thus unifies module-wise budget control under a learnable, query-aware neural policy, enabling LLM agents to adapt resource allocation according to application and operational constraints. By disentangling cost regimes along implementation, reasoning, and capacity axes within a modular skeleton, BudgetMem advances the state of cost-efficient, high-accuracy long-context agent reasoning (Zhang et al., 5 Feb 2026, Alla et al., 7 Nov 2025, Qian et al., 12 Jan 2026).