Beyond Browsing: API-Based Web Agents

Published 21 Oct 2024 in cs.CL and cs.MA | (2410.16464v3)

Abstract: Web browsers are a portal to the internet, where much of human activity is undertaken. Thus, there has been significant research work in AI agents that interact with the internet through web browsing. However, there is also another interface designed specifically for machine interaction with online content: application programming interfaces (APIs). In this paper we ask -- what if we were to take tasks traditionally tackled by Browsing Agents, and give AI agents access to APIs? To do so, we propose two varieties of agents: (1) an API-calling agent that attempts to perform online tasks through APIs only, similar to traditional coding agents, and (2) a Hybrid Agent that can interact with online data through both web browsing and APIs. In experiments on WebArena, a widely-used and realistic benchmark for web navigation tasks, we find that API-Based Agents outperform web Browsing Agents. Hybrid Agents out-perform both others nearly uniformly across tasks, resulting in a more than 24.0% absolute improvement over web browsing alone, achieving a success rate of 38.9%, the SOTA performance among task-agnostic agents. These results strongly suggest that when APIs are available, they present an attractive alternative to relying on web browsing alone.

Abstract PDF HTML Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper introduces API-Based and Hybrid agents that enhance task performance by utilizing API calls instead of traditional web browsing.
The paper demonstrates that the Hybrid Agent achieves over a 20% increase in success rates by adaptively switching between API interactions and browsing.
The paper evaluates these agents using the WebArena benchmark, showing superior efficiency on platforms with robust API documentation like GitLab.

"Beyond Browsing: API-Based Web Agents" (2410.16464)

Introduction

The paper "Beyond Browsing: API-Based Web Agents" investigates the potential of enhancing AI agents' performance on web-based tasks by moving beyond traditional web browsing interfaces to incorporate APIs. The exploration focuses on the development and implementation of an API-calling agent and a Hybrid Agent. Given the rising complexity and diversity of web interactions, the shift toward API utilization offers promising advantages in terms of task efficiency and performance.

Agent Types and Implementation

The research introduces two main varieties of agents: the API-Based Agent and the Hybrid Agent.

API-Based Agent: Focused solely on interacting with web applications through APIs, this agent generates and executes code to perform tasks. Using the Python requests library, the agent makes HTTP requests to communicate with web services, bypassing the need for GUI manipulation.
Hybrid Agent: Combines both web browsing and API interactions, allowing the agent to choose the most effective approach based on task requirements. This versatility assures that the agent can switch between API calls and traditional browsing to optimize task performance.

The Hybrid Agent is designed to dynamically choose its actions based on the available API documentation and current web interface state. It intelligently integrates API documentation, enabling access to necessary endpoints when required.

Experimental Evaluation

Experiments were conducted using the WebArena benchmark, which simulates real-world web-navigation tasks. Results indicate that API-Based Agents outperform traditional browsing agents, especially in environments rich with comprehensive API support. The Hybrid Agent achieved a success rate increase of over 20% compared to web browsing alone, highlighting its adaptability and efficiency.

Figure 1: A comparison of three types of agents. The browse!100{Browsing Agent} performs tasks through web browsing only, while the hybrid!100{Hybrid Agent} and api!100{API-Based Agent} execute code and API calls respectively.

Further analysis categorizes the API availability into three levels — "good", "medium", and "poor" — based on number of endpoints and documentation quality. For websites like GitLab, with extensive API support, the API-Based Agent shows a marked improvement in task performance.

Technical Insights

The implementation of both agents necessitated distinct approaches for handling API documentation. For lightweight API sets, the documentation is directly embedded in the agent's prompt. However, larger API sets required a two-stage retrieval process, where detailed API documentation is fetched as needed during task execution.

Figure 2: An example of API documentation showing how to get commits of a project.

Crucial to the Hybrid Agent's design is its decision-making process at each step, determining whether to engage in web browsing, utilize API calls, or solicit human-like interaction for clarification.

Impact and Future Directions

The implications of transitioning to API-Based and Hybrid Agents are significant for developing more efficient and capable AI web agents. This research underscores the potential of APIs to streamline web interactions beyond traditional GUI-based approaches.

Potential future developments may focus on automating the discovery and usage of undocumented APIs, possibly leveraging LLM capabilities to autocomplete partial API definitions or translate web interactions into formal API calls.

Conclusion

The study convincingly argues that API interactions, designed specifically for machine use, are not merely an alternative but often a complementary or superior strategy to traditional web browsing for AI agents. The Hybrid Agent, with its adaptive strategy, demonstrates particularly robust performance across varying API support levels, setting a new standard for task-agnostic web agents on complex web-navigation tasks.

Markdown Report Issue