To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video.

WALT: Web Agents that Learn Tools

This presentation explores WALT, an innovative framework that transforms brittle web agents into robust tool-users by reverse-engineering website functionality into deterministic, schema-validated tools. By shifting the focus from low-level UI clicks to high-level operations like searching and posting, the authors demonstrate how agents can achieve state-of-the-art success rates with significantly higher efficiency.

Script

Why do even the smartest AI agents struggle when a website simply moves a button or changes its layout? Today we explore WALT, a framework that solves this by teaching agents to see websites as sets of functional tools rather than just collections of pixels and code.

Traditional agents are often too focused on the mechanics of clicking and typing, making them highly sensitive to minor design shifts. This heavy reliance on long-horizon reasoning creates a fragile system where a single missing selector can cause the entire task to fail.

The authors propose shifting the paradigm from manual UI navigation to high-level tool execution.

Instead of guessing where to click, WALT treats high-level functions like searching or filtering as callable tools. These tools use validated schemas and action scripts that abstract away the messy details of the underlying web interface.

Moving beyond static scripts, WALT follows a robust pipeline that first discovers potential site functions and then validates them through an offline two-agent loop. This process ensures that only tools that meet strict correctness and efficiency standards are ever exposed to the agent at runtime.

To further improve reliability, the framework uses a key mechanism called URL promotion, where multi-step UI interactions are replaced by direct URL parameter operations. They also implement selector stabilization to protect against minor site updates, using agentic steps only as a last resort.

Let's examine how these architectural choices translate into real-world performance gains.

The results are impressive, with the authors reporting state-of-the-art performance on major benchmarks including WebArena and VisualWebArena. Not only did the agents succeed more often, but they did so 1.4 times faster by skipping unnecessary UI reasoning.

Despite these gains, the researchers acknowledge limitations such as the high offline cost of tool discovery and the ongoing challenge of anti-automation measures. Future work focuses on online tool patching to handle site redesigns as they happen.

WALT demonstrates that by treating web functions as deterministic tools, we can create agents that are both more reliable and more efficient than those navigating by pixels alone. To dive deeper into the data, visit EmergentMind.com.