Create a Video View Paper

ClawBench: Can AI Agents Complete Everyday Online Tasks?

This presentation examines ClawBench, a groundbreaking benchmark that evaluates AI agents on real-world web tasks involving live websites and consequential actions. Unlike controlled testbeds where frontier models achieve 65-75% success, the best agent reaches only 33% on ClawBench's 153 tasks spanning shopping, booking, finance, and job applications. The talk explores the benchmark's innovative HTTP interception safety mechanism, its five-layer behavioral recording system, and the agentic evaluator that compares agent trajectories against human demonstrations. The results reveal a critical gap: existing language model agents cannot reliably handle the dynamic complexity, authentication flows, and write-heavy interactions of everyday online workflows.

Script

The best AI agents today score 65 to 75 percent on standard web benchmarks, suggesting they're nearly ready to automate your online tasks. But when researchers tested them on real websites with real consequences, the success rate collapsed to 33 percent. ClawBench reveals why.

Previous benchmarks test agents in sanitized sandboxes with static pages and read-only tasks. ClawBench operates on production websites where agents must authenticate, navigate evolving interfaces, and execute actions with real-world consequences. The authors built tasks that mirror what humans actually do online, not what's easy to evaluate.

Testing consequential actions on live platforms creates an obvious problem: you can't let the agent actually submit job applications or make purchases.

The researchers engineered a Chrome extension that captures and blocks only the irreversible endpoint, the moment before money moves or data submits. Everything leading up to that point runs unmodified on the real website. This precision lets agents operate in authentic environments while guaranteeing safety, and the five logging layers create a complete behavioral record for diagnosis.

Evaluation doesn't just check final outcomes. The agentic evaluator compares the agent's entire trajectory against a human reference across session video, screenshots, HTTP payloads, reasoning traces, and browser actions. A Claude Code sub-agent applies a structured rubric to align actions step by step, checking field schemas and payload correctness. This produces not just a pass or fail label, but explicit justifications showing exactly where and why the agent diverged from correct behavior.

The strongest agent, Claude Sonnet 4.6, achieves 33 percent on ClawBench compared to over 65 percent on established benchmarks. Other frontier models fare worse: GPT-5.4 barely reaches 7 percent, and some nearly fail outright. The gap exposes categorical deficiencies. Agents struggle with dynamic layouts, authentication sequences, and the cognitive demands of assembling correct payloads in the right order. Saturation on controlled benchmarks does not transfer to the messy reality of live web platforms.

ClawBench proves that today's AI agents are not ready to reliably handle your everyday online workflows. The benchmark's live-site fidelity and surgical safety design establish a new standard for evaluating agentic systems in the wild. Visit EmergentMind.com to explore the research and create your own video presentations.