Papers
Topics
Authors
Recent
Search
2000 character limit reached

ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

Published 11 Apr 2026 in cs.AI, cs.OS, and cs.SE | (2604.10352v1)

Abstract: Stateful tool-using LLM agents treat the context window as working memory, yet today's agent harnesses manage residency and durability as best-effort, causing recurring failures: lost state after compaction, bypassed flushes on reset, and destructive writeback. We present \textsc{ClawVM}, a virtual memory layer that manages state as typed pages with minimum-fidelity invariants, multi-resolution representations under a token budget, and validated writeback at every lifecycle boundary. Because the harness already assembles prompts, mediates tools, and observes lifecycle events, it is the natural enforcement point; placing the contract there makes residency and durability deterministic and auditable. Across synthetic workloads, 12 real-session traces, and adversarial stress tests, \textsc{ClawVM} eliminates all policy-controllable faults whenever the minimum-fidelity set fits within the token budget, confirmed by an offline oracle, and adds median <50 microseconds of policy-engine overhead per turn.

Summary

  • The paper presents ClawVM, a harness-level virtual memory abstraction designed to eliminate policy-controllable state management faults in LLM agents.
  • It employs multi-resolution page representations and deterministic prompt assembly to ensure zero observed structural faults across diverse workloads.
  • Experimental results demonstrate significant reductions in paging instability and overhead, validating ClawVM as a reliable memory management framework.

ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

Motivation and Problem Statement

Stateful LLM agents, particularly those orchestrated by frameworks such as OpenClaw and its derivatives, operate extensively across hundreds of tool invocations and long-running sessions. These agents rely on the limited context window of an LLM as their working memory, while archiving additional state in external persistent stores. Critical operational correctness hinges on the timely residency and durability of key state—such as policies, constraints, plans, tool outputs, and conversation context—within the LLM's prompt. However, contemporary agent harnesses usually offer only best-effort heuristics for managing this memory, leaving systems vulnerable to recurring classes of failure: e.g., loss of state during compaction or reset, destructive writeback overwrites, and non-auditable paging decisions. Field reports and issue trackers consistently document these breakdowns, which manifest as repeated tool invocations, inadvertent protocol or rule loss, and silent progress erasure.

Modern retrieval, pruning, and memory plugins mitigate some faults, but none provides a deterministic, enforceable contract around memory residency, durability, or auditability. Current work borrowing OS paging metaphors fails to close this enforcement gap at production scale.

ClawVM Architecture and Memory Contract

ClawVM introduces a structured, harness-level virtual memory abstraction for agent state, formalizing prompt assembly as deterministic page management akin to OS-level virtual memory. Agent state is represented as typed "pages", each with a stable identifier, scope, provenance, and a minimum-fidelity invariant—stipulating the lowest representation, e.g., full text, compressed, structured fields, or a pointer, to which it may be degraded under token-budget constraints.

Multi-Resolution Residency and Paging

ClawVM supports four levels of page representation:

  • Full: Verbatim state excerpt.
  • Compressed: Token-reduced form (e.g., through lossy compression via methods like LLMLingua-2 (Jin et al., 2024)).
  • Structured: Typed schema preserving essential fields for invariants.
  • Pointer: Metadata and a resolvable handle.

Representation variants are precomputed at ingestion/update time; prompt assembly involves only deterministic table lookups and token accounting, avoiding on-the-fly LLM calls under pressure. Minimum-fidelity invariants ensure essential semantics are preserved even at maximum compression.

Lifecycle-Complete and Validated Writeback

At every critical lifecycle boundary (compaction, pruning, reset), ClawVM orchestrates deterministic, non-destructive persistence. The system mandates staged writeback as a transaction, incorporating structured staging, deterministic validation (schema, scope, provenance, merge semantics), and scoped commit. Non-destructive or out-of-policy updates are rejected with explicit reason codes and logged for auditing.

Observable Fault Model

Departing from silent failure modes, ClawVM instruments all relevant memory-management decisions and faults. Faults are defined for both residency (e.g., missing refetches, duplicate or repeated tool-calls, plan/protocol loss) and durability (e.g., uncommitted flushes, silent recall errors). Each is made explicitly observable and logged; an offline replay oracle is provided to distinguish policy from physical or semantic failures.

Deterministic Prompt Assembly

A two-phase selection policy is enforced:

  1. Phase 1: All hard-pinned pages and their minimum-fidelity representations are installed, surfacing invariant/budget non-compliance as observable faults if resource limits are breached.
  2. Phase 2: Marginal upgrades (pointer→structured→compressed→full) are greedily chosen by utility per token, incorporating pin status, recency, scope, and recompute cost.

This decouples structural safety from quality optimization, with the former guaranteed and the latter tunable via utility scoring or oracle integration.

Implementation and Experimental Results

ClawVM is implemented in six Python modules, designed to wrap generic agent harnesses that expose lifecycle events and state management hooks. It is retrieval-backend-agnostic and pluggable with external memory sources.

Experiments employ both synthetic and real trace-derived workloads (coding, operations, writing, task automation, etc.), adversarial stressors, and baseline policies including practitioner-optimized compaction+retrieval configurations.

Key strong results include:

  • Complete elimination of policy-controllable faults across all synthetic, trace-based, and adversarial scenarios whenever minimum-fidelity requirements fit the token budget, matching an offline oracle.
  • Reduction in explicit faults from 67.8 (retrieval-only) and 1.5 (best-practice compaction+retrieval) to zero across 24 configurations.
  • Paging instability (thrash) reduction by up to 77.4% vs. retrieval, and 11.4% vs. compaction baselines.
  • Zero observed faults on 12 real-session traces and 30 diverse synthetic task workloads at tight and loose budgets, versus up to 23% failure for non-ClawVM baselines under budget pressure.
  • Negligible per-turn overhead (<<50μs median policy-engine time per turn), strictly subdominant to model or tool call durations.
  • Structural safety is robust to heuristic choice: LRU and utility-based scoring achieve identical zero-fault safety, confirming safety derives from enforcement rather than heuristic tuning.

Ablation studies establish that pointer resolution, auto-pinning, and lifecycle-complete writeback are all indispensable for full fault elimination; other features are non-critical to safety but impact prompt-quality optimization.

Adversarial testing demonstrates that failures only arise from intrinsic physical insufficiency (e.g., insufficient token budget for all hard-pinned pages) or semantic errors (outside ClawVM's verification scope).

Implications and Future Directions

ClawVM closes the core architectural gap between best-effort context heuristics and deterministic memory management for stateful, tool-using LLM agents. It enables harness-level, contract-based management of agent state, supporting replayable auditing, robust lifecycle transitions, and explicit surfacing of policy versus physical or semantic errors. By decoupling safety guarantees from heuristic tuning, ClawVM provides a foundation for higher-level optimization—enabling integration of model-driven heuristics, hybrid recall quality improvements, and OS-inspired abstractions.

Practical implications are particularly strong for persistent agents in productivity, automation, and coding domains, where accumulated context and reliable persistence are essential, and failures manifest as user-facing regressions.

Integration with existing systems (MemGPT (Packer et al., 2023), MemOS (Li et al., 4 Jul 2025), Memory OS [2025.emnlp-main.1318], A-MEM [FiM0M8gcct]) is straightforward, as ClawVM's enforcement layer is harness-level and agnostic to underlying retrieval or storage backends. Further, ClawVM's explicit fault instrumentation and replay oracle have value as independent benchmarking and regression diagnostics for evolving agent memory policies.

Theoretical implications suggest that robust agentic memory does not require sophisticated online optimization or adaptive heuristics so long as structural contracts and enforcement primitives are in place. This aligns with classical working set models [Denning, 1968] and reinforces the importance of OS-inspired design over ad hoc RL or deep learning-based memory controllers in the control plane.

Conclusion

ClawVM provides a harness-driven, enforceable virtual memory abstraction for stateful LLM agent systems, eliminating the spectrum of policy-controllable memory management failures. Structural safety is ensured by construction via explicit typed pages, minimum-fidelity invariants, multi-resolution residency, and validated writeback. Empirically, ClawVM achieves zero structural faults and negligible overhead across a wide range of workloads and policy configurations. Its design decouples reliability from heuristic tuning and supports extensible quality optimization, establishing a new default for memory management in persistent LLM agent infrastructures.

Future research can extend ClawVM to multi-agent orchestrators, integrate full semantic validation layers, and conduct live, user-interactive evaluations of prompt-quality under dynamically shifting workloads.

Reference:

"ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents" (2604.10352)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.