A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
Abstract: Pre-trained LLMs have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.
- Boosting search engines with interactive agents. In Transactions on Machine Learning Research, 2022.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arxiv:2204.01691, 2022.
- Etc: Encoding long and structured inputs in transformers. arXiv preprint arXiv:2004.08483, 2020.
- Colt5: Faster long-range transformers with conditional computation. arXiv preprint arXiv:2303.09752, 2023.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Docformer: End-to-end transformer for document understanding. In International Conference on Computer Vision, 2021.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021a.
- WebSRC: A dataset for web-based structural reading comprehension. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 4173–4185, 2021b.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Scaling instruction-finetuned language models. arXiv preprint arxiv:2210.11416, 2022.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- A discourse-aware attention model for abstractive summarization of long documents. arXiv preprint arXiv:1804.05685, 2018.
- Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023.
- User-driven automation of web form filling. In International Conference on Web Engineering, 2013.
- Multi-news: a large-scale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749, 2019.
- Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020.
- Multimodal web navigation with instruction-finetuned foundation models. arXiv preprint arxiv:2305.11854, 2023.
- Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2023.
- Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. arXiv preprint arXiv:2101.02235, 2021.
- LongT5: Efficient text-to-text transformer for long sequences. In Findings of the Association for Computational Linguistics: NAACL 2022, pp. 724–736, 2022.
- Learning to navigate the web. In International Conference on Learning Representations, 2019.
- Understanding html with large language models. arXiv preprint arxiv:2210.03945, 2022.
- Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, 2021.
- Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021a.
- Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021b.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022.
- A data-driven approach for learning to control computers. In International Conference on Machine Learning, 2022.
- DOM-q-NET: Grounded RL on structured language. In International Conference on Learning Representations, 2019.
- Language models can solve computer tasks. arXiv preprint arxiv:2303.17491, 2023.
- Large language models are zero-shot reasoners. In Advances In Neural Information Processing Systems, 2022.
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
- The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059, November 2021.
- Structurallm: Structural pre-training for form understanding. arXiv preprint arXiv:2105.11210, 2021a.
- Markuplm: Pre-training of text and markup language for visually-rich document understanding. arXiv preprint arxiv:2110.08518, 2021b.
- Selfdoc: Self-supervised document representation learning. In Conference on Computer Vision and Pattern Recognition, 2021c.
- Mapping natural language instructions to mobile ui action sequences. In Annual Conference of the Association for Computational Linguistics, 2020.
- Competition-level code generation with alphacode, 2022.
- Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753, 2023.
- Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp. 74–81. Association for Computational Linguistics, July 2004.
- Llm+p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023.
- Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations, 2018.
- The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
- Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021.
- Flin: A flexible natural language interface for web navigation. arXiv preprint arXiv:2010.12844, 2020.
- A diverse corpus for evaluating and developing english math word problem solvers. arXiv preprint arXiv:2106.15772, 2021.
- Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. arXiv preprint arXiv:1611.04230, 2016.
- Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning, 2023.
- Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling. In International Conference on Machine Learning, 2023.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Training language models to follow instructions with human feedback. arXiv preprint arxiv:2203.02155, 2022.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.
- Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022.
- Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
- Scaling up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189, 2022.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
- Bigpatent: A large-scale dataset for abstractive and coherent summarization. arXiv preprint arXiv:1906.03741, 2019.
- From pixels to ui actions: Learning to follow instructions via graphical user interfaces. arXiv preprint arXiv:2306.00245, 2023.
- World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, 2017.
- Appbuddy: Learning to accomplish tasks in mobile apps via reinforcement learning. In Canadian Conference on Artificial Intelligence, 2021.
- Generalized planning in pddl domains with pretrained large language models. arXiv preprint arXiv:2305.11014, 2023.
- ProgPrompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022.
- Adaplanner: Adaptive planning from feedback with language models. arXiv preprint arXiv:2305.16653, 2023.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2019.
- Ul2: Unifying language learning paradigms. arXiv preprint arXiv:2205.05131, 2022.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arxiv:2302.13971, 2023.
- Androidenv: A reinforcement learning platform for android. arXiv preprint arXiv:2105.13231, 2021.
- Learning to synthesize programs as interpretable and generalizable policies. arXiv preprint arXiv:2108.13643, 2022.
- Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498, 2023.
- Attention is all you need. In Advances in neural information processing systems, 2017.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
- Webformer: The web-page transformer for structure information extraction. arXiv preprint arXiv:2202.00217, 2022a.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022b.
- CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8696–8708, 2021.
- Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. In International Conference on Machine Learning, 2023b.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
- Small models are valuable plug-ins for large language models. arXiv preprint arXiv:2305.08848, 2023.
- LayoutLM: Pre-training of text and layout for document image understanding. arXiv preprint arxiv:1912.13318, 2019.
- Webshop: Towards scalable real-world web interaction with grounded language agents. arXiv preprint arxiv:2207.01206, 2022a.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022b.
- Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
- Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, 2020.
- TIE: Topological information enhanced structural reading comprehension on web pages. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1808–1821, 2022.
- Synapse: Leveraging few-shot exemplars for human-level computer control. arXiv preprint arXiv:2306.07863, 2023.
- Mediasum: A large-scale media interview dataset for dialogue summarization. arXiv preprint arXiv:2103.06410, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.