- The paper presents a novel source-level rewriting approach that expands agent adaptation from text artifacts to harness logic, enabling Turing-complete modifications.
- The system architecture integrates modular components, an external coding-agent interface, and a host-daemon to manage container lifecycles and ensure state continuity.
- Empirical results demonstrate significant harness improvements, with grader scores rising from 0.25 to 0.61 and achieving a task score of 0.9049, validating the deterministic evolution paradigm.
Source-Level Self-Evolution in Autonomous Agent Systems: An Analysis of MOSS
Motivation and Scope Expansion
MOSS introduces source-level self-rewriting as the central mechanism for autonomous adaptation in agent systems. Previous architectures in self-evolving agents have systematically limited their editable scope to text artifacts—skills, prompts, memory schemas, workflow graphs—leaving codebases and harness logic immutable post-deployment. This restriction is physical in nature; artifacts in text are incapable of reaching failures whose root cause lies within routing, session lifecycle, state invariants, or dispatch logic embedded in the harness. The paper asserts that lifting the editable surface to source-code grants agents Turing-complete capacity for adaptation, is strictly more general than text-mutable evolution, achieves deterministic behavioral upgrades (rather than compliance-contingent prompt modifications), and avoids context-drift erosion, addressing a fundamental class of structural failures unreachable to prior frameworks.
System Architecture
MOSS operationalizes source-level adaptation with a modular system, comprising a substrate agent container, a CLI-driven control surface, a pluggable external coding-agent interface, a host-daemon for orchestration, and ephemeral trial workers for candidate verification. The substrate, demonstrated using OpenClaw, hosts a user-facing agent with persistent state, exposing evolution capabilities through a moss evo CLI embedded in its shell interface. The external coding-agent is decoupled via a four-method runner interface, with integrated support for multiple providers (Claude Code, OpenAI Codex, DeepSeek-TUI, OpenCode) to ensure independence from LLM vendor lock-in.
The host-daemon, running outside the substrate container, manages container lifecycles, batch evidence processing, code modification orchestration, and swap logic, ensuring atomic state preservation across agent upgrades. User state volumes are persistently mounted to guarantee seamless state continuity after container swaps.
Directed Evolution Workflow
MOSS implements a directed, deterministic evolution paradigm informed by concrete production-failure evidence, replacing the exploratory, benchmark-driven style found in prior minimal-scaffold agents. Failure evidence is batch-curated by auto-scanning session logs and user conversational flags, and evolution is triggered when sufficient batch samples accumulate.
The core of the evolution process is a bounded iteration loop instantiated as a seven-stage pipeline—Locate, Plan, Plan-Review, Implement, Code-Review, Task-Evaluate, and Verdict—each explicitly separated for quality gating and context isolation. Modification is achieved via stage-specific invocations of the external coding-agent, with planning and code reviews employing multi-round retry loops to enforce correctness. Runtime verification is performed in production-equivalent containers using ephemeral trial workers, scoring keypoints qualitatively and gating convergence on batch-specific improvement. Container swaps occur only upon explicit user consent, with health-probe-gated rollback guaranteeing operational resilience against regressions.
Empirical Results
On a controlled batch of claweval benchmark tasks—comprising SLA compliance audits and restock-chain diagnostics in both Chinese and English—MOSS executes a complete evolution cycle. The original OpenClaw agent exhibits harness-level defects, including partial result reporting and misattributed outputs, rooted in tool-result mediation and dispatch synthesis logic.
The iteration-1 outcome exhibits substantive harness-level remediation: a mean grader score increase from 0.25 to 0.61 is observed, with one task achieving a score of 0.9049 (well above the 0.75 pass threshold). Transcript analysis confirms correction of multi-tool execution coverage gaps and improved semantic annotation, validating upgrade at the behavioral level. Crucially, all modifications landed at the harness layer, demonstrating the claimed reach unattainable by text-only evolution.
Comparative Analysis and Positioning
MOSS is positioned as the first system in its category to unify source-level adaptation with production-grade deployment. Minimal-scaffold agents (SICA, Darwin Gödel Machine, HyperAgents) established self-modification feasibility but relied on static benchmarks and lacked integration with persistent, live agent substrates. Application-level systems (Hermes Agent, SkillClaw, GenericAgent, EvoAgentX) were able to evolve skills or prompts but categorically excluded codebase and harness edits. MOSS subsumes both paradigms, delivering deterministic evolution anchored to production failure, with runtime verification and operational safety guarantees, thus empirically resolving the physical scope limitation.
Practical and Theoretical Implications
Practically, MOSS expands the agentic evolution frontier to robust, persistent deployments, empowering agent systems to rectify deep structural errors autonomously and maintain behavioral integrity and state continuity. The paradigm signifies a shift from artifact-level learning to system-level adaptation, which is critical for addressing maturity-induced complexity in agent harnesses.
Theoretically, the approach instantiates universal search spaces for agent modification, invites future research into safe automated code generation, and brings the field closer to practical self-improving systems capable of continued deployment in rich user environments. With deterministic verification, user consent gating, and health-probe rollback, MOSS represents the confluence of agentic autonomy with production reliability.
Emerging directions include refining evidence curation, enhancing planning and code review logic, scaling evolution depth, and exploring more generalized integration modalities for heterogeneous agent substrates.
Conclusion
MOSS demonstrates that source-level adaptation is strictly superior to the text-mutable evolution previously employed in autonomous agents, granting access to harness-level modification, deterministic effect, and erosion resistance. The system closes the loop from curated production-failure evidence through deterministic modification to verified deployment, exemplified by empirical grader score lifts without human intervention. By expanding the editable substrate from skill and prompt configurations to harness logic, MOSS resolves fundamental adaptation bottlenecks and establishes a new operational baseline for self-evolving agent systems (2605.22794).