A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

Published 24 Jul 2023 in cs.LG, cs.AI, and cs.CL | (2307.12856v4)

Abstract: Pre-trained LLMs have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.

Abstract PDF HTML Upgrade to Chat

References (89)

Citations (136)

View on Semantic Scholar

Summary

The paper demonstrates a modular design that combines planning (HTML-T5) and program synthesis (Flan-U-PaLM), improving web automation success by over 50% on complex HTML tasks.
Methodologically, WebAgent uses local and global attention with long-span denoising pre-training to enhance HTML comprehension and task planning.
Empirical results on benchmarks like MiniWoB and Mind2Web underscore the state-of-the-art performance of specialized LLM components in autonomous web tasks.

Insights into "A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis"

The paper by Gur et al. presents a novel solution to autonomous web automation through WebAgent, a system enhanced with a LLM. WebAgent stands out due to its modular architecture, which includes a planning and summarization component (HTML-T5) and code synthesis capability (Flan-U-PaLM), targeting the intrinsic challenges of real-world web environments.

WebAgent's innovation is driven by three main complexities: open domain tasks, managing lengthy HTML documents, and the deficiency of inductive biases specific to HTML structures. These factors have previously hindered autonomous agents' performance in dynamic web environments. WebAgent addresses these issues through self-experience learning and specialized LLMs, such as HTML-T5, which is equipped with local and global attention mechanisms to handle long HTML documents while leveraging a mixture of long-span denoising pre-training objectives to capture both syntax and semantics more effectively.

Empirical studies reveal significant improvements in real-world application scenarios, achieving over a 50% success rate increase on complex HTML tasks compared to existing methods. HTML-T5 notably outperforms previous models by 18.7% in the MiniWoB web automation benchmark, a testament to its refined understanding and task planning capability. On the Mind2Web offline task planning evaluation, HTML-T5 achieves state-of-the-art (SoTA) performance, even surpassing models like GPT-4.

For WebAgent, the integration of Flan-U-PaLM is crucial for open-ended task execution via Python programs, allowing sophisticated action plans across diverse web platforms like real estate, social media, and map navigation sites. This approach underlines the importance of separating the planning from execution, optimizing each step with tailored LLM components. Not only does WebAgent improve web automation rates, but it also enhances general HTML understanding through specialized pre-training.

Evaluations on WebSRC, a static HTML comprehension dataset, further validate WebAgent's robust performance. It competes aggressively with state-of-the-art models due to its modular, collaborative LLM configuration. The rigorous experiments demonstrate that tackling each complexity with task-specific models secures more reliable outcomes than relying on a singular generalist model approach.

WebAgent's journey introduces several broader implications. Practically, it suggests a future where AI can seamlessly integrate and navigate complex, ever-changing web landscapes, adapting to varying user needs and styles. Theoretically, it posits the modular configuration of agents as a promising path forward in AI, leveraging specialization for enhanced performance over purely scaling model sizes.

As we consider the trajectory of autonomous web agents, this paper implies that future strides will involve a strategic blend of modular design and scalable learning from dynamic, real-world interactions. The research enriches our perspective on how LLMs can be honed to tackle real-world automation complexities, while also anticipating the emergence of more nuanced, task-sensitive AI solutions.