- The paper presents WildClawBench, a benchmark that evaluates long-horizon autonomous agents using diverse, reproducible, multimodal tasks in native Docker environments.
- It employs a four-stage curation pipeline and a hybrid evaluation approach combining rule-based checks with semantic verification across six task categories.
- Evaluation results reveal significant performance gaps across agent harnesses, highlighting issues in cross-modal reasoning, time management, and safety compliance.
WildClawBench: Comprehensive Evaluation of Long-Horizon Native-Runtime Agents
Problem Statement and Motivation
The surge in autonomous agents powered by large language and vision-LLMs, deployed via CLI-based harnesses, has catalyzed the need for robust benchmarks that address realistic, long-horizon workflows. Existing benchmarks primarily focus on synthetic sandboxes, short-horizon tasks, mock APIs, and final-answer checks, failing to capture the full ecology and complexity of native agent deployment. These deficiencies obscure accurate assessment of agentic reasoning, tool orchestration, safety alignment, and trajectory-level auditability. WildClawBench was introduced to bridge this gap, assembling 60 human-authored, bilingual, multimodal tasks executed within reproducible Docker containers with access to genuine tools and evaluated via a hybrid verification protocol (2605.10912).
Benchmark Design and Methodology
WildClawBench spans six categories: Productivity Flow, Code Intelligence, Social Interaction, Search & Retrieval, Creative Synthesis, and Safety Alignment. Tasks are constructed to enforce multi-step, cross-tool planning and real-world artifact verification. Each task specification includes agent-facing prompts, expected behavior, time budgets (averaging 881s, with per-task wall-clock durations of approximately 8 minutes), input assets, and individually tailored rubrics. The agent operates within isolated Docker containers, interacting natively with shells, browsers, file systems, email clients, and optional skill extensions.
A four-stage curation pipeline ensures ecological validity, discriminability, and auditability:
- Task authoring: Drafting tasks requiring long-horizon workflows and verifiable environment effects.
- Reference answer construction: Human experts produce reference solutions and precise grading rubrics.
- Filtering: Tasks showing insufficient discriminability across pilot models (<0.2 score gap) or ambiguous evaluation criteria are discarded.
- Refinement: Iterative adjustment of prompts, rubrics, graders, and asset complexity.
The data set includes 36 English and 24 Chinese tasks, with 26 multimodal and 34 pure-text modalities, reflecting realistic operational diversity.
Evaluation Protocol
WildClawBench employs a hybrid evaluation strategy:
- Rule-based checks: Deterministic validation of file existence, format, numerical correctness, and workspace cleanliness.
- Environment-state audits: Verification of execution side effects for tool-based actions (email, calendar, chat), including explicit refusal of malicious instructions.
- LLM/VLM-as-judge: Semantic verification for complex artifacts (narratives, images, video clips, or detection of prompt injections), implemented with GPT 5.4 as a proxy judge. The framework prevents asset leakage by mounting grading-only assets post-execution.
Agents are evaluated under four harnesses: OpenClaw, Claude Code, Codex, and Hermes Agent, with locked tool schemas and system prompts to isolate model variance.
Quantitative Results and Analysis
WildClawBench exposes substantial headroom in agentic systems. The top model, Claude Opus 4.7, achieves only 62.2% overall task completion on OpenClaw; all other models fall below 60%, with scores ranging from 19.3% to 62.2%. Multimodal workflows consistently lag behind pure text (Claude Opus 4.7: 58.5% vs. 65.0%; GPT 5.4: 40.2% vs. 58.0%), indicating persistent limitations in cross-modal tool use and visual grounding.
Harness selection significantly affects performance: the same underlying model can vary up to 18 points across harnesses (e.g., MiMo V2. Pro). Latency-bound harnesses (Claude Code) frequently cause timeouts and score regression, underscoring the integration sensitivity to control-loop, tool schema, and context management.
Domain-specific breakdowns reveal divergent strengths. Claude Opus 4.7 dominates Productivity, Code Intelligence, and Safety tasks; GPT 5.5 outperforms on information retrieval; DeepSeek V4. Pro leads Social Interaction, suggesting unseen axes of agentic specialization beyond aggregate scores.
Augmenting agents with domain-specific skills boosts Code Intelligence and Creative Synthesis across all models, but improvements elsewhere are contingent on baseline capability and skill relevance. Performance scaling with time budgets displays a sharp drop under constraint and diminishing return under extension, i.e., current paradigms do not optimize agentic computation for environmental interaction.
Failure-mode analysis identifies dominant causes: wrong/partial artifacts (often produced but failing rubric compliance), timeout/hung process, safety violations, and missing artifacts. Process-level breakdowns indicate time-budget exhaustion, debugging loops, toolchain disruption, and semantic misses, emphasizing the importance of robust execution monitoring.
Bilingual analysis indicates higher performance on English prompts, with gaps varying by model (up to 7.4 points for MiniMax M2.7). Variance across repeated runs is low, validating the stability of the evaluation setup.
Practical and Theoretical Implications
WildClawBench sets a new standard for ecological validity, trajectory-level auditability, and reproducible agentic evaluation. The benchmark’s categorical results show that agent performance in long-horizon tasks is not purely a function of model intelligence, but critically dependent on harness design, environmental orchestration, and skill integration.
From a practical standpoint, WildClawBench's reproducible containers and hybrid verification uncover concrete failure modes, tool-use profiles, and safety limitations of deployed agents. This enables precise diagnosis and targeted improvement before production deployment. The inclusion of adversarial safety challenges (e.g., prompt injection, credential leaks, destructive shell commands) provides a rigorous adversarial boundary for safety alignment assessment.
Theoretically, WildClawBench demonstrates that agentic benchmarking requires multi-channel, environment-grounded evaluation that extends well beyond answer correctness. The artifacts, trajectory, and execution environment collectively constitute the evaluated system; future agent development must focus on optimizing orchestration efficiency, cross-modal reasoning, and proactive safety.
Future Directions
WildClawBench identifies two notable coverage gaps: multi-turn interaction protocols and domain expansion into GUI-heavy workflows and specialist domains (biology, finance, law). Scaling task diversity, expanding language coverage, and benchmarking incremental agent update via continuous signals remain open avenues. Further, integration of probabilistic uncertainty estimation, advanced adjudication for subjective content, and automated root-cause failure analysis could deepen diagnostic power.
There is scope to drive development of harness-agnostic yet environment-faithful agentic benchmarks, coupled with finer-grained trajectory auditing, adversarial robustness evaluation, and post-deployment calibration tracking.
Conclusion
WildClawBench delivers a rigorously constructed, native-runtime benchmark for comprehensive long-horizon agent evaluation, revealing substantial unsolved challenges in autonomous orchestration, multimodal reasoning, harness integration, and safety alignment. Its results demonstrate the need for holistic, environment-grounded evaluation methods. The benchmark is positioned to support reproducible progress measurement, precise failure mode exposure, and informed development of next-generation agent frameworks.