You Don't Know Until You Click:Automated GUI Testing for Production-Ready Software Evaluation

Published 17 Aug 2025 in cs.SE and cs.AI | (2508.14104v1)

Abstract: LLMs and code agents in software development are rapidly evolving from generating isolated code snippets to producing full-fledged software applications with graphical interfaces, interactive logic, and dynamic behaviors. However, current benchmarks fall short in evaluating such production-ready software, as they often rely on static checks or binary pass/fail scripts, failing to capture the interactive behaviors and runtime dynamics that define real-world usability - qualities that only emerge when an application is actively used. This is the blind spot of current evaluation: you don't know if an app works until you click through it, interact with it, and observe how it responds. To bridge this gap, we introduce RealDevWorld, a novel evaluation framework for automated end-to-end assessment of LLMs' ability to generate production-ready repositories from scratch. It features two key components: (1) RealDevBench, a diverse collection of 194 open-ended software engineering tasks across multiple domains, incorporating multimodal elements to reflect real-world complexity; and (2) AppEvalPilot, a new agent-as-a-judge evaluation system that simulates realistic, GUI-based user interactions to automatically and holistically assess software functional correctness, visual fidelity, and runtime behavior. The framework delivers fine-grained, task-specific diagnostic feedback, supporting nuanced evaluation beyond simple success/failure judgments. Empirical results show that RealDevWorld delivers effective, automatic, and human-aligned evaluations, achieving an accuracy of 0.92 and a correlation of 0.85 with expert human assessments, while significantly reducing the reliance on manual review. This enables scalable, human-aligned assessment of production-level software generated by LLMs. Our code is available on GitHub.

Abstract PDF Upgrade to Chat

Summary

The paper introduces RealDevWorld, which combines RealDevBench and AppEvalPilot to assess production-ready software through automated GUI testing.
It employs multimodal elements and agent-based testing to simulate realistic user interactions, achieving 0.92 accuracy and 0.85 expert correlation.
Empirical results show that the framework outperforms traditional benchmarks by providing scalable, nuanced, and human-aligned software quality evaluations.

Automated GUI Testing for Production-Ready Software Evaluation

Introduction

The paper discusses the growing capabilities and challenges of automated GUI testing for production-level software evaluation using LLMs. As LLMs progress towards generating sophisticated software with dynamic user interfaces, current evaluation benchmarks fall short in addressing the runtime dynamics and interactive behaviors critical for production-ready applications.

RealDevWorld: A Novel Evaluation Framework

The authors introduce RealDevWorld, an end-to-end framework designed to evaluate the ability of LLMs to create complete, production-ready software repositories. RealDevWorld consists of two main components: RealDevBench and AppEvalPilot.

RealDevBench

RealDevBench is a comprehensive suite of 194 software engineering tasks sampled from real-world development scenarios across various domains, such as data analysis, display, and gaming. It is designed to test the LLM’s ability to handle complex software requirements by integrating multimodal elements like images and audio.

Figure 1: Representative cases from across four domains - Data, Display, Analysis, and Game - with consistent triplet structure (requirements, features, materials), reflecting real-world software engineering challenges.

AppEvalPilot

AppEvalPilot is a novel agent-based evaluation system that autonomously performs GUI-based interactions to assess the functional correctness of generated software. It emulates realistic user behaviors, capturing detailed execution traces, and provides nuanced feedback beyond binary success/failure metrics.

Figure 2: Overall design of AppEvalPilot showing the automated testing workflow: test case generation from user requirements, multimodal test execution through interface interaction (scrolling, typing, clicking), and binary evaluation of outcomes for objective software assessment.

Evaluation Methodology

The paper presents a new methodology for evaluating LLM-generated software using dynamic interaction testing. This approach focuses on user-centric and runtime-dependent behaviors that conventional static metrics fail to capture. The framework effectively measures software correctness and quality in a realistic usage context through adaptive test case generation and execution.

Figure 3: Evaluation pipeline of AppEvalPilot. The agent performs test sequences on two different web implementations, systematically assesses functionality through direct interaction, documents observable differences in form behavior, and generates quantitative scores based on test cases.

Empirical Results and Findings

Experiments show that the RealDevWorld framework provides human-aligned software evaluations with an accuracy of 0.92 and a correlation of 0.85 with expert assessments. This demonstrates the framework’s ability to deliver scalable and nuanced evaluations, significantly reducing reliance on manual review while aligning closely with human judgments.

Comparative Evaluation

RealDevWorld outperforms existing benchmarks by providing a more comprehensive evaluation platform for repository-level code generation tasks. The framework adapts to a wide variety of input data and interaction scenarios, making it applicable across diverse application types.

Figure 4: Comparative analysis of evaluation methods versus human quality. (Left) AppEvalPilot's autonomous evaluation, (Middle) Static LLM code scoring, (Right) Visual aesthetic scoring. Each point represents one project, with radial distance indicating quality scores (0-1 scale).

Conclusion

This framework facilitates a deeper understanding of LLM capabilities in generating production-ready software, offering a robust solution for automated, scalable, and nuanced software evaluation. By addressing the limitations of existing static evaluation methods, RealDevWorld paves the way for future advancements in AI-driven software engineering assessment.