Are Autonomous Web Agents Good Testers?

Published 2 Apr 2025 in cs.SE | (2504.01495v1)

Abstract: Despite advances in automated testing, manual testing remains prevalent due to the high maintenance demands associated with test script fragility-scripts often break with minor changes in application structure. Recent developments in LLMs offer a potential alternative by powering Autonomous Web Agents (AWAs) that can autonomously interact with applications. These agents may serve as Autonomous Test Agents (ATAs), potentially reducing the need for maintenance-heavy automated scripts by utilising natural language instructions similar to those used by human testers. This paper investigates the feasibility of adapting AWAs for natural language test case execution and how to evaluate them. We contribute with (1) a benchmark of three offline web applications, and a suite of 113 manual test cases, split between passing and failing cases, to evaluate and compare ATAs performance, (2) SeeAct-ATA and pinATA, two open-source ATA implementations capable of executing test steps, verifying assertions and giving verdicts, and (3) comparative experiments using our benchmark that quantifies our ATAs effectiveness. Finally we also proceed to a qualitative evaluation to identify the limitations of PinATA, our best performing implementation. Our findings reveal that our simple implementation, SeeAct-ATA, does not perform well compared to our more advanced PinATA implementation when executing test cases (50% performance improvement). However, while PinATA obtains around 60% of correct verdict and up to a promising 94% specificity, we identify several limitations that need to be addressed to develop more resilient and reliable ATAs, paving the way for robust, low maintenance test automation. CCS Concepts: $\bullet$ Software and its engineering $\rightarrow$ Software testing and debugging.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that LLM-based autonomous agents can reduce maintenance burdens of traditional web test scripts.
It compares two ATA implementations, with PinATA showing a 50% performance boost and achieving 60% correct verdicts and 94% specificity.
Experimental results highlight the need for enhanced grounding and resilience to handle dynamic web interfaces effectively.

Are Autonomous Web Agents Good Testers?

The paper explores the feasibility of using Autonomous Web Agents (AWAs), powered by LLMs, as Autonomous Test Agents (ATAs) for executing natural language test cases in web applications. This research addresses the transition from automated testing scripts, which tend to be fragile and require high maintenance, to autonomous agents that can reduce such burdens.

Introduction

Automated testing in web applications has traditionally involved the use of scripts that simulate user interactions. However, these scripts often break with minor changes in the application's structure, leading to significant maintenance efforts [hammoudi_why_2016]. LLMs offer potential in resolving these issues by enabling agents to interact with web applications through natural language instructions, similar to manual testers.

Figure 1: An E2E test black case.

Benchmark and ATA Implementations

The research presents a benchmark comprising 113 manual test cases for three web applications along with the development of two ATA implementations: SeeAct-ATA and PinATA. SeeAct-ATA takes a simple approach with reliance on LLM processing, whereas PinATA incorporates advanced grounding and reasoning techniques. PinATA demonstrated significantly better performance (50% improvement) over SeeAct-ATA, obtaining around 60% correct verdict and 94% specificity.

PinATA utilizes LLMs to interpret screenshots and interact with applications, using a structured iterative process to plan and execute web tasks autonomously [zheng_gpt-4vision_2024]. This involves performing actions, acquiring observational data, evaluating results, and deciding on subsequent steps.

Experimental Evaluation

The comparative experiments showcase the effectiveness of ATAs in executing web application tests. PinATA outperforms SeeAct-ATA significantly, showing robustness in test executions. However, limitations exist concerning misinterpretations and the need for enhanced resilience in real-world scenarios.

Figure 2: Autonomous Web Agent main #1 process.

The experiments use metrics like accuracy, specificity, and sensitivity to quantitatively evaluate the ATAs’ performance in aligning with human testers. PinATA achieves higher sensitivity and specificity levels than SeeAct-ATA.

Limitations and Future Research

Despite promising results, several limitations need addressing, including the agent's ability to handle dynamic web interfaces and improvements in reasoning for complex web interactions. Future developments in ATA require advancing robustness, grounding precision, and handling varied interface designs.

Figure 3: Executions of the 113 test case with PinATA with three LLMs.

Conclusion

The paper concludes that while AWAs exhibit potential as ATAs, there remains a substantial road to developing reliable, autonomous test agents. The transition from fragile automated scripts to robust AI-driven agents hinges on improvements in LLM capabilities and advanced agent architectures. The benchmark and implementations demonstrated can guide future research efforts toward optimizing these agents for practical application in web testing.

References

The exploration aligns with recent developments in LLMs and web automation agents, contributing valuable insights into the potential for autonomous web testing [wang_survey_2024] [zhang_webpilot_2024]. More research is needed to tackle existing challenges and leverage autonomous agents towards low-maintenance testing solutions.

Markdown Report Issue