- The paper presents ClawVM, a harness-level virtual memory abstraction designed to eliminate policy-controllable state management faults in LLM agents.
- It employs multi-resolution page representations and deterministic prompt assembly to ensure zero observed structural faults across diverse workloads.
- Experimental results demonstrate significant reductions in paging instability and overhead, validating ClawVM as a reliable memory management framework.
Motivation and Problem Statement
Stateful LLM agents, particularly those orchestrated by frameworks such as OpenClaw and its derivatives, operate extensively across hundreds of tool invocations and long-running sessions. These agents rely on the limited context window of an LLM as their working memory, while archiving additional state in external persistent stores. Critical operational correctness hinges on the timely residency and durability of key state—such as policies, constraints, plans, tool outputs, and conversation context—within the LLM's prompt. However, contemporary agent harnesses usually offer only best-effort heuristics for managing this memory, leaving systems vulnerable to recurring classes of failure: e.g., loss of state during compaction or reset, destructive writeback overwrites, and non-auditable paging decisions. Field reports and issue trackers consistently document these breakdowns, which manifest as repeated tool invocations, inadvertent protocol or rule loss, and silent progress erasure.
Modern retrieval, pruning, and memory plugins mitigate some faults, but none provides a deterministic, enforceable contract around memory residency, durability, or auditability. Current work borrowing OS paging metaphors fails to close this enforcement gap at production scale.
ClawVM Architecture and Memory Contract
ClawVM introduces a structured, harness-level virtual memory abstraction for agent state, formalizing prompt assembly as deterministic page management akin to OS-level virtual memory. Agent state is represented as typed "pages", each with a stable identifier, scope, provenance, and a minimum-fidelity invariant—stipulating the lowest representation, e.g., full text, compressed, structured fields, or a pointer, to which it may be degraded under token-budget constraints.
Multi-Resolution Residency and Paging
ClawVM supports four levels of page representation:
- Full: Verbatim state excerpt.
- Compressed: Token-reduced form (e.g., through lossy compression via methods like LLMLingua-2 (Jin et al., 2024)).
- Structured: Typed schema preserving essential fields for invariants.
- Pointer: Metadata and a resolvable handle.
Representation variants are precomputed at ingestion/update time; prompt assembly involves only deterministic table lookups and token accounting, avoiding on-the-fly LLM calls under pressure. Minimum-fidelity invariants ensure essential semantics are preserved even at maximum compression.
Lifecycle-Complete and Validated Writeback
At every critical lifecycle boundary (compaction, pruning, reset), ClawVM orchestrates deterministic, non-destructive persistence. The system mandates staged writeback as a transaction, incorporating structured staging, deterministic validation (schema, scope, provenance, merge semantics), and scoped commit. Non-destructive or out-of-policy updates are rejected with explicit reason codes and logged for auditing.
Observable Fault Model
Departing from silent failure modes, ClawVM instruments all relevant memory-management decisions and faults. Faults are defined for both residency (e.g., missing refetches, duplicate or repeated tool-calls, plan/protocol loss) and durability (e.g., uncommitted flushes, silent recall errors). Each is made explicitly observable and logged; an offline replay oracle is provided to distinguish policy from physical or semantic failures.
Deterministic Prompt Assembly
A two-phase selection policy is enforced:
- Phase 1: All hard-pinned pages and their minimum-fidelity representations are installed, surfacing invariant/budget non-compliance as observable faults if resource limits are breached.
- Phase 2: Marginal upgrades (pointer→structured→compressed→full) are greedily chosen by utility per token, incorporating pin status, recency, scope, and recompute cost.
This decouples structural safety from quality optimization, with the former guaranteed and the latter tunable via utility scoring or oracle integration.
Implementation and Experimental Results
ClawVM is implemented in six Python modules, designed to wrap generic agent harnesses that expose lifecycle events and state management hooks. It is retrieval-backend-agnostic and pluggable with external memory sources.
Experiments employ both synthetic and real trace-derived workloads (coding, operations, writing, task automation, etc.), adversarial stressors, and baseline policies including practitioner-optimized compaction+retrieval configurations.
Key strong results include:
- Complete elimination of policy-controllable faults across all synthetic, trace-based, and adversarial scenarios whenever minimum-fidelity requirements fit the token budget, matching an offline oracle.
- Reduction in explicit faults from 67.8 (retrieval-only) and 1.5 (best-practice compaction+retrieval) to zero across 24 configurations.
- Paging instability (thrash) reduction by up to 77.4% vs. retrieval, and 11.4% vs. compaction baselines.
- Zero observed faults on 12 real-session traces and 30 diverse synthetic task workloads at tight and loose budgets, versus up to 23% failure for non-ClawVM baselines under budget pressure.
- Negligible per-turn overhead (<50μs median policy-engine time per turn), strictly subdominant to model or tool call durations.
- Structural safety is robust to heuristic choice: LRU and utility-based scoring achieve identical zero-fault safety, confirming safety derives from enforcement rather than heuristic tuning.
Ablation studies establish that pointer resolution, auto-pinning, and lifecycle-complete writeback are all indispensable for full fault elimination; other features are non-critical to safety but impact prompt-quality optimization.
Adversarial testing demonstrates that failures only arise from intrinsic physical insufficiency (e.g., insufficient token budget for all hard-pinned pages) or semantic errors (outside ClawVM's verification scope).
Implications and Future Directions
ClawVM closes the core architectural gap between best-effort context heuristics and deterministic memory management for stateful, tool-using LLM agents. It enables harness-level, contract-based management of agent state, supporting replayable auditing, robust lifecycle transitions, and explicit surfacing of policy versus physical or semantic errors. By decoupling safety guarantees from heuristic tuning, ClawVM provides a foundation for higher-level optimization—enabling integration of model-driven heuristics, hybrid recall quality improvements, and OS-inspired abstractions.
Practical implications are particularly strong for persistent agents in productivity, automation, and coding domains, where accumulated context and reliable persistence are essential, and failures manifest as user-facing regressions.
Integration with existing systems (MemGPT (Packer et al., 2023), MemOS (Li et al., 4 Jul 2025), Memory OS [2025.emnlp-main.1318], A-MEM [FiM0M8gcct]) is straightforward, as ClawVM's enforcement layer is harness-level and agnostic to underlying retrieval or storage backends. Further, ClawVM's explicit fault instrumentation and replay oracle have value as independent benchmarking and regression diagnostics for evolving agent memory policies.
Theoretical implications suggest that robust agentic memory does not require sophisticated online optimization or adaptive heuristics so long as structural contracts and enforcement primitives are in place. This aligns with classical working set models [Denning, 1968] and reinforces the importance of OS-inspired design over ad hoc RL or deep learning-based memory controllers in the control plane.
Conclusion
ClawVM provides a harness-driven, enforceable virtual memory abstraction for stateful LLM agent systems, eliminating the spectrum of policy-controllable memory management failures. Structural safety is ensured by construction via explicit typed pages, minimum-fidelity invariants, multi-resolution residency, and validated writeback. Empirically, ClawVM achieves zero structural faults and negligible overhead across a wide range of workloads and policy configurations. Its design decouples reliability from heuristic tuning and supports extensible quality optimization, establishing a new default for memory management in persistent LLM agent infrastructures.
Future research can extend ClawVM to multi-agent orchestrators, integrate full semantic validation layers, and conduct live, user-interactive evaluations of prompt-quality under dynamically shifting workloads.
Reference:
"ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents" (2604.10352)