MobileWorldBench: Semantic Mobile GUI Benchmark

Updated 23 December 2025

MobileWorldBench is a benchmark suite that systematically evaluates semantic world modeling for mobile GUI agents using natural-language state predictions.
It comprises 1.4M annotated transitions from real mobile interactions, offering rigorous datasets and metrics for next-state generation and yes/no QA tasks.
Integration with model-based planning pipelines shows significant improvements in task success rates and decision-making over traditional pixel-based models.

MobileWorldBench is a benchmark and dataset suite that systematically evaluates semantic world modeling for mobile GUI agents, focusing on the capacity of vision–LLMs (VLMs) to predict and reason over future mobile app states in abstract, decision-relevant natural language, rather than low-level pixel outputs. Conceived to address the limitations of pixel-space world models in complex GUI environments, MobileWorldBench and its associated MobileWorld dataset offer comprehensive empirical infrastructure for advancing model-based planning in mobile interaction scenarios, scaling to 1.4 million annotated transitions and integrated with concrete agent planning frameworks (Li et al., 16 Dec 2025).

1. Benchmark Definition and Objectives

The fundamental goal of MobileWorldBench is to enable rigorous, large-scale evaluation of a VLM's ability to model state transitions in mobile GUIs at the semantic level. The benchmark is built on two primary tasks:

Next-State-Generation: Given a current screen $X_t$ and an action $a_t$ , the model generates a natural-language description $y_{t+1}$ of the resulting state change.
Next-State-QA: Given $(X_t, a_t)$ , the model answers a set of yes/no questions about the predicted next state $X_{t+1}$ .

This formulation is motivated by the impracticality of pixel-level world models in GUI settings, where the diversity of layouts, textual content, and icons makes faithful pixel-wise prediction intractable and unnecessary for decision-making. Instead, semantic world models abstract the dynamics into high-level, goal-relevant descriptions—sufficient for downstream planning frameworks.

2. Dataset Composition and Annotation Pipeline

MobileWorld, the dataset underpinning MobileWorldBench, consists of 1.4 million $(X_t, a_t, X_{t+1})$ triplets. The data is minimally curated from human interaction trajectories recorded in large-scale mobile environments such as Android in the Wild (AiTW) and AndroidControl. Each transition is annotated with:

Three candidate natural-language state-change descriptions, with best-of-three human selection ( $y_{t+1}$ ).
Approximately eight candidate yes/no QA pairs per transition, filtered for relevance and correctness, yielding $\sim$ 540,000 high-quality pairs.

The annotation pipeline is highly structured:

Trajectory Collection: Automated recording of user actions and resulting GUIs.
VLM Annotation: High-level actions derived from low-level gestures and visual overlays, prompting a VLM to propose candidate descriptions and questions.
Automated Filtering: VLM self-checking (answering its own questions using $X_{t+1}$ ), with irrelevant pairs pruned.
Human Verification: Manual curation to ensure correctness, focus, and non-ambiguity, culminating in a finalized pool of 1,787 human-verified QA items for benchmark evaluation.

A typical sample is:

$(X_t,\, a_t,\, X_{t+1},\, y_{t+1},\, \{(q^i, o^i)\}_{i=1}^N)$

3. Semantic World Model Formalism

MobileWorldBench frames semantic world modeling as a latent variable generative process:

$p(X_{t+1}\mid X_t,a_t) = \sum_{z_{t+1}} p(z_{t+1}\mid X_t,a_t)\,p(X_{t+1}\mid z_{t+1},X_t)$

where $z_{t+1}$ represents the latent semantic change. The VLM is trained to model

$p_\theta(z_{t+1}\mid X_t, a_t)$

and supports two operational modes:

Text Generation: $y_{t+1} \sim p_\theta(y \mid z_{t+1})$
Yes/No QA: For each $q^i$ , $o^i_{t+1} \sim p_\theta(o \mid q^i, z_{t+1})$

The training objective is a joint cross-entropy loss over generation and QA heads:

$\mathcal{L} = \lambda_{\text{gen}}\, \mathrm{CE}_{\text{tokens}} + \lambda_{\text{QA}}\,\mathrm{CE}_{\text{yes/no}}$

4. Vision–LLM Architecture and Training

The core model is Qwen3-VL-8B-Instruct, a multimodal transformer with separate vision and language modules:

Vision Encoder: Patch-based ViT encoder followed by projection and cross-attention integration into the text stack.
LLM: Decoder-only transformer optimized for instruction-following.
Input Representation: Encoded screenshot $X_t$ , natural-language action $a_t$ , and optional question $q^i$ with interleaved cross-modal attention.
Optimization Regimen: Fine-tuned for 2 epochs on MobileWorld using 8 × NVIDIA A6000, AdamW optimizer, batch size 128, with learning rates $2\times 10^{-6}$ (LLM) and $2\times 10^{-7}$ (vision encoder).

5. Integration with Model-Based Planning Pipelines

MobileWorldBench is explicitly designed for integration into model-based agent planning. The canonical inference loop is as follows:

At step $t$ , given goal $G$ and state $X_t$ , an action proposal model generates candidates $\{a_t^j\}$ .
For each $a_t^j$ , the VLM-based semantic world model predicts $z^j_{t+1}$ .
A free-form description $y^j_{t+1}$ is decoded.
Each $y^j_{t+1}$ is scored via a value model $f_\phi(G, y^j_{t+1}) \in [1, 10]$ .
The agent selects $a_t = \arg\max_j v^j_{t+1}$ .

This one-step lookahead is an approximate maximization of the expected-reward objective:

$J(\pi) = \mathbb{E}_\pi\left[\sum_t r(s_t,a_t)\right] \approx \max_{a_t} r(X_t, a_t, z_{t+1})$

where $r$ is value-model-predicted reward.

6. Evaluation Metrics and Empirical Findings

Evaluation is performed on both text generation and QA axes:

Next-State-Generation: VLM-judge metrics for accuracy (0–5), relevance (0–5), completeness (0–5), with an overall score in $[0,15]$ .
Next-State-QA: Fraction of correctly answered yes/no questions (binary accuracy).

Benchmark splits:

Generation: 250 held-out transitions.
QA: 500 held-out transitions, 1,787 QA pairs.

Quantitative improvements from MobileWorld-finetuned VLM (vs. base Qwen3-VL-8B-Instruct):

Metric	Baseline	Finetuned	Absolute Gain
Generation Overall (/15)	11.84	12.39	+4.7%
QA Accuracy (%)	67.32	71.40	+4.08 pts
Downstream Task SR (%)	46.9	54.3	+7.4 pts

Ablation experiments indicate the VLM-annotator pipeline (VLM-generated training labels) outperforms off-the-shelf VLMs, supporting the validity of the data construction.

7. Implications and Research Significance

Tractability: Semantic world modeling bypasses the intractable combinatorics of pixel forecasting in GUIs, allowing direct optimization for decision-relevant abstractions.
Benchmark Scale and Quality: MobileWorldBench is the first large-scale, human-verified benchmark for semantic next-state prediction in mobile UIs, with 1.4 M annotated samples.
Empirical Utility: Finetuned VLMs on MobileWorld produce significant gains in both state-change generation, question answering, and end-to-end AndroidWorld task success, illustrating the real-world benefits of semantic world modeling.
Planning Integration: Model-based planning with a learned value model over semantic summaries yields a >7 percentage point increase in task completion, directly affecting agent robustness in realistic settings.
Future Directions: This semantic abstraction approach supports further development of high-level reasoning in embodied and GUI agents, with expected impact on robust, sample-efficient task completion and transferability across diverse UIs.

MobileWorldBench is openly available, including dataset, benchmark splits, and model checkpoints, facilitating broad adoption and reproducibility in research on semantic world models for mobile agents (Li et al., 16 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MobileWorldBench: Towards Semantic World Modeling For Mobile Agents (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MobileWorldBench.