MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration

Published 3 Apr 2026 in cond-mat.mtrl-sci and cs.SE | (2604.02688v1)

Abstract: Existing LLM agents for computational materials science are constrained by pipeline-bounded architectures tied to specific simulation codes and by dependence on manually written tool functions that grow with task scope. We present MatClaw, a code-first agent that writes and executes Python directly, composing any installed domain library to orchestrate multi-code workflows on remote HPC clusters without predefined tool functions. To sustain coherent execution across multi-day workflows, MatClaw uses a four-layer memory architecture that prevents progressive context loss, and retrieval-augmented generation over domain source code that raises per-step API-call accuracy to ${\sim}$99 %. Three end-to-end demonstrations on ferroelectric CuInP2S6 (machine-learning force field training via active learning, Curie temperature prediction, and heuristic parameter-space search) reveal that the agent handles code generation reliably but struggles with tacit domain knowledge. The missing knowledge, such as appropriate simulation timescales, equilibration protocols, and sampling strategies, is the kind that researchers accumulate through experience but rarely formalize. Two lightweight interventions, literature self-learning and expert-specified constraints, bridge these gaps, defining a guided autonomy model in which the researcher provides high-level domain knowledge while the agent handles workflow execution. Our results demonstrate that the gap between guided and fully autonomous computational materials research is narrower than ever before: LLMs already handle code generation and scientific interpretation reliably, and the rapid improvement in their capabilities will accelerate materials discovery beyond what manual workflows can achieve. All code and benchmarks are open-source.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a code-first paradigm where MatClaw autonomously writes Python to execute complex workflows in materials research.
The paper leverages a four-layer memory architecture and retrieval-augmented generation to boost API accuracy to 97–99% and ensure robust performance.
The paper demonstrates end-to-end applications in ML force field distillation, Curie temperature prediction, and domain wall propagation with adaptive execution.

MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration

Introduction and Motivation

MatClaw presents a paradigm shift in autonomous scientific agents for computational materials science by explicitly rejecting pipeline-bounded, tool-call–dependent architectures in favor of a code-first approach. Unlike prior agents that constrain workflows to fixed, manually engineered toolkits or specialized code pipelines, MatClaw writes and executes Python natively, orchestrating any installed domain library—enabling arbitrary composition, conditional branching, and dynamic recovery in heterogeneous high-performance computing (HPC) environments. This repositions the LLM agent, not as a string manipulator, but as an integrated scientific co-worker capable of complex, flexible, and long-horizon workflow execution without the need for growth in manually engineered toolsets as the task space expands.

Architecture and System Design

The code-first paradigm in MatClaw leverages the maturity and embedded domain expertise of established Python libraries (e.g., pymatgen, atomate2, jobflow, DeePMD-kit) as the action space for the agent. The LLM agent directly generates structured Python code conditioned on multi-phased output fields (phase, plan, code, summary), which encourages coherent pipeline tracking, explicit planning, code correctness, and efficient context summarization.

MatClaw’s execution on remote HPC clusters is facilitated by core architectural elements:

Four-layer Memory Architecture:

In-context working memory (LLM context) maintains active state.
Episodic conversation history is managed via persistent append-only logs, with rapid retrieval via precomputed per-step summaries.
Semantic experience log stores cross-session lessons (e.g., operational constraints, error patterns) as dynamically reloadable text—immediately modifiable by either agent or human.
Database external grounding provides direct access to numerical information from completed jobs (energies, structures), essential for long-horizon recall after context truncation.

Zone-based Context Pruning:

Context is conservatively capped to override model-advertised windows due to empirically observed "context rot." Aggressive message pruning produces multiple zones: recent steps are protected, mid-history steps are truncated or masked, and the deepest history is replaced by a marker. Full recovery is possible via the persistent log and lightweight pre-generated summaries.

Retrieval-Augmented Generation (RAG):

Code-aware RAG leverages structure-aware chunking (especially tree-sitter–based methods) and lexical retrieval (BM25 with reciprocal rank fusion) to inject relevant domain code/documentation directly into the context at generation time. This demonstrably improves per-step API accuracy from as low as 70–85% (depending on library popularity) to 97–99%.

These design choices render MatClaw resilient to catastrophic context loss, flexible for multi-code compositions, and robust under extended workflow runtimes.

End-to-End Demonstrations in Materials Research

Three end-to-end demonstrations on ferroelectric CuInP2S6 (CIPS) were undertaken to validate the system’s empirical performance and probe the limits of agentic autonomy. Tasks encompassed ML force field distillation via active learning, Curie temperature prediction from MD, and heuristic (parameter-space) search for domain wall propagation regimes.

Task 1: ML force field distillation.

The baseline agent failed due to insufficient configuration-space coverage (1 ps trajectories miss the relevant dynamical regime), despite correct code and workflow logic.
When guided to extract methodology from literature (DP-GEN scheme) and given a minimal sampling timescale constraint, the agent automatically internalized sophisticated strategies (multi-model sigma bands, filtering for data diversity/validity) and achieved a physically robust student model after two active learning iterations.

Task 2: Curie temperature ( $T_c$ ) discovery.

Unconstrained, the agent completed the workflow, yet failed to detect non-equilibrium artifacts (e.g., non-monotonic order parameters due to inadequate MD convergence).
Augmenting the prompt with a simple convergence validation step led the agent to self-diagnose order parameter misalignments, adapt analysis strategy, and converge on $T_c$ with a 3.5 $\times$ lower uncertainty in half the steps.

Task 3: Heuristic (E, T) search for domain wall propagation.

Relying on a well-defined quantitative detection metric, the agent autonomously conducted a physics-driven search through a two-dimensional regime, adaptively selecting new simulations, and identified an optimal condition (wall velocity $\sim$ 640 m/s) in 7 search iterations versus hundreds for a grid sweep.

Failure Analysis:

Common agentic weaknesses emerged only when domain expertise was tacit, underdocumented, or experiential—for instance, optimal MD sampling times or convergence criteria. These gaps were not failures of code or logic, but of embedded practical knowledge outside of LLM training datasets.

Effective Interventions:

Literature self-learning — agent reads and extracts methodology into persistent memory.
High-level expert constraints — minimal, human-supplied requirements (simulation length, convergence validation) rectify domain expertise gaps.

Retrieval-Augmented Generation: Quantitative Performance

Multiple-choice API and documentation QA benchmarks (pymatgen code, VASP wiki, jobflow-remote) were employed to characterize RAG’s effect on per-question accuracy:

Without RAG: API accuracy correlated with library popularity (pymatgen: 90%, VASP wiki: 86%, jobflow-remote: 76%). Compound error rates render multi-step workflow execution unreliable absent retrieval.
With code-aware RAG: Uniform accuracy in the 97–99% range across all libraries and LLM provider generations, closing the gap even for highly niche toolchains.
Cross-model trends: RAG produces 8–13 percent point gains regardless of model epoch; newer LLMs show baseline improvement, yet the gain from retrieval remains robust.

Structure-aware chunking (tree-sitter or equivalent) in retrieval consistently outperforms naïve fixed-width or purely AST-based splits by 1–3%, with BM25 performing best on API-keyword lookups.

Implications and Future Directions

Immediate Practical Impact:

MatClaw demonstrates that code-first, RAG-augmented LLM agents can autonomously execute multi-day, complex computational workflows in materials science with high reliability—provided the domain expert tailors workflow-level constraints and supplies (or directs to extract) critical tacit knowledge. This partnership dramatically accelerates tasks like parameter-space exploration and active learning, which scale poorly with human bandwidth but are routine for computational agents.

Theoretical Implications:

The results pinpoint the separation between codified, documentable domain knowledge (which LLMs ingest and operationalize at near-human or better accuracy with retrieval) and tacit, folklore-level expertise. The latter remains the key bottleneck for full autonomy. As the code-first paradigm matures, the human–agent interaction mode will increasingly be one of "guided autonomy"—the domain scientist imparts high-level methodology, while the agent performs exhaustive, adaptive, and error-resilient execution.

Pathways for Future Work:

Expansion to new domains: The architecture is library- and backend-agnostic; extension to adjacent scientific fields is immediate.
Reducing reliance on prompt engineering: Agents may self-improve by autonomously synthesizing best practices from literature and collaborative logbooks, closing remaining autonomy gaps.
Adaptive tacit knowledge acquisition: Integration with broader scientific corpus mining and data-driven protocol synthesis could further overcome bottlenecks of underdocumented expertise.
Evolving LLM capabilities: As inherent LLM accuracy on domain APIs continues to improve, the proportion of agent failures attributable to pure model knowledge deficits will decrease, increasing the scope of fully autonomous execution.

Conclusion

MatClaw exemplifies a robust, code-first LLM agent for computational materials research, leveraging retrieval-augmented generation, persistent architectural memory, and minimal but targeted human intervention to deliver reliable, long-horizon workflow execution. The demarcation between guided and fully autonomous scientific discovery is narrowing, and the architecture provides a blueprint for scalable, flexible agentic research frameworks applicable across computational science. The results support both immediate acceleration of systematic studies and inform the future trajectory of agentic autonomy grounded in both code proficiency and assimilated scientific methodology.

Reference: "MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration" (2604.02688)

Markdown Report Issue