Papers
Topics
Authors
Recent
Search
2000 character limit reached

gWorld: Data Generation Framework

Updated 9 February 2026
  • gWorld is a family of open-weight vision–language models that generate HTML + CSS code to predict mobile GUI states with high semantic and visual fidelity.
  • It employs a two-stage architecture with a frozen vision encoder and a fine-tuned language decoder, leveraging code generation to ensure precise layout rendering.
  • The framework uses a robust data generation pipeline, repurposing GUI transition datasets and synthesizing reasoning traces to achieve state-of-the-art performance.

gWorld (8B, 32B) is a family of open-weight vision–language World Models (WMs) specifically designed for mobile Graphical User Interface (GUI) state prediction. Unlike previous approaches that generate next GUI states as either raw pixels or text, gWorld employs a code-generation paradigm wherein the model predicts executable web code (HTML + CSS). This shift enables pixel-perfect text rendering and high-fidelity layouts while retaining the semantic precision of Vision-LLMs (VLMs). Built upon Qwen3-VL 8B and Qwen3-VL 32B backbones and leveraging a comprehensive code-based data generation pipeline, gWorld sets new state-of-the-art Pareto frontiers on key benchmarks in terms of accuracy and model size, outperforming much larger baseline models (Koh et al., 2 Feb 2026).

1. Architectural Foundation and Training

gWorld 8B and gWorld 32B are derived via supervised fine-tuning of Qwen3-VL 8B and Qwen3-VL 32B, respectively. The architectures retain a two-stage VLM backbone: a frozen vision encoder producing patch embeddings, followed by a LLM-style decoder. An MLP projector maps image tokens into the LLM’s token space. During model adaptation for code generation, the decoder’s output vocabulary and positional embedding layers are utilized to produce HTML tokens directly, with no addition of new attention mechanisms. Cross-modal attention between image patches and text naturally enables the model to reason jointly about layout, typography, and semantics. The vision encoder parameters remain frozen during fine-tuning; only the decoder and lightweight MLP projector are updated. The 8B/32B denotes total finetunable parameter counts. On modern 4 × H200 GPU setups, throughput reaches approximately 20,000 tokens/sec for gWorld 8B (0.25 s per state) and 5,000 tokens/sec for gWorld 32B (1 s per state), supporting high-throughput simulation.

2. Code-Generation Paradigm

Unlike traditional pixel-generation WMs that estimate pθ(St+1St,At)p_\theta(S_{t+1}\mid S_t, A_t) in image space, gWorld models pθ(code(St+1)St,At)p_\theta(\mathrm{code}(S_{t+1})\mid S_t, A_t), directly predicting HTML + CSS code that, when rendered, reproduces the screenshot of the predicted GUI state. The supervised fine-tuning loss is expressed as: LSFT=tlogpθ(Rt,St+1codeStimage,At)Et[logpθ(RtSt,At)]+Et[logpθ(St+1codeRt,St,At)]\mathcal{L}_{\mathrm{SFT}} = -\sum_{t} \log p_{\theta}\bigl(R_t, S_{t+1}^{\mathrm{code}} \mid S_t^{\mathrm{image}}, A_t\bigr) \approx -\mathbb{E}_t[\log p_\theta(R_t\mid S_t, A_t)] + -\mathbb{E}_t[\log p_\theta(S_{t+1}^{\mathrm{code}} \mid R_t, S_t, A_t)] where RtR_t is an intermediate natural-language reasoning trace describing semantic state transitions. This chain-of-thought decomposition stabilizes learning. Code tokens, by explicitly encoding text content, typography, and element positions, systematically eliminate errors such as illegible text or distorted GUI layouts that are common in pixel-generation approaches.

3. Data Generation and Repurposing Pipeline

The gWorld data generation framework synthesizes large-scale, high-quality training data as follows:

  • Trajectory Repurposing: Transition tuples (St,At,St+1)(S_t,\,A_t,\,S_{t+1}) are extracted from existing offline policy datasets (AndroidInTheWild, GUIOdyssey, AndroidControl, AMEX), forming pairs (St,At)St+1(S_t, A_t)\rightarrow S_{t+1}.
  • Cross-modal Relabeling: A large frontier model, Gemini 3 Flash, is prompted to translate each next-state screenshot into semantically equivalent, renderable web code. This step yields St+1codeS_{t+1}^{\mathrm{code}}.
  • Reasoning Synthesis: The same model, conditioned on both StS_t and St+1S_{t+1}, is used to generate RtR_t, a textual state-change summary.
  • Scale: The initial gWorld suite comprises 260,000 annotated samples. Empirical scaling studies suggest continued gains as data scale increases toward the 3.7 million available transitions, following a fitted power law (R20.94R^2\geq 0.94).

4. Evaluation Metrics, Benchmarks, and Baselines

Evaluation is conducted via MWMBench, which consists of six next-state prediction tasks: four in-distribution (AitW, GUIOdyssey, AndroidControl, AMEX) and two out-of-distribution splits (AndroidWorld, KApps, the latter being Korean-language GUIs). Actions are retained in coordinate form to preserve execution semantics.

The primary metrics are:

  • Instruction Accuracy (IAcc.): The proportion of step predictions judged correct by an ensemble of VLM-judges (GPT-5 Mini, Claude 4.5 Haiku, Gemini 3 Flash).
  • Similarity: The average cosine similarity between DINO v1/v2 image embeddings of rendered outputs and target screenshots.

Baselines include both pixel-generating models (Qwen-Image-Edit 20B, Emu3.5 34B) and large code-generation–capable VLMs (Llama 4 109B/402B, Qwen3-VL 8B/32B/235B, GLM-4.6V 106B).

Model Param. Count IAcc. (%) Render-Fail Rate (%)
gWorld 8B 8B 74.9 <1
gWorld 32B 32B 79.6 <1
Llama 4 402B 402B ~59 ---

gWorld models surpass baseline models far larger in parameter count by more than 20 percentage points in IAcc. Average render-fail rates are maintained below 1%.

5. Ablation Studies

Ablation and analysis clarify the key contributors to gWorld’s performance:

  • Image-Generation Limitations: Pixel models (e.g., diffusion architectures) display a correlation coefficient (ρ>0.7\rho>0.7) between input–output similarity (Sim(St,St+1)\mathrm{Sim}(S_t, S_{t+1}) and Sim(S^t+1,St+1)\mathrm{Sim}(\hat{S}_{t+1}, S_{t+1})), indicative of near-identity copying and limited capacity to model actual transition dynamics. Conversely, gWorld’s ρ0.4\rho\approx0.4 demonstrates more effective modeling of non-trivial state transitions.
  • Data Scaling: Sequential increases in training set size (37K ⇒ 77K ⇒ 129K ⇒ 240K samples) yield monotonic IAcc. improvements across all test splits. The results follow a power law: scaling the data to the full 3.7M samples is predicted to yield non-saturating improvements.
  • Code-Generation Ablation: Cross-modal relabeling (frontier model–generated code from screenshots) improves IAcc. by +5.4 percentage points versus naïve image→code prompting and raises renderable code rates from 97% to 100%.
  • Reasoning Trace: Providing look-ahead synthesized RtR_t boosts IAcc. by up to 3 percentage points when training on smaller datasets (37K samples).
  • Downstream Policy Gains: Integrating gWorld 8B in an M3A agent’s rollout-and-value planning yields an average step-wise accuracy improvement of +22 percentage points compared to a Qwen3-VL 8B value baseline.

6. Implications and Significance

gWorld’s code-as-world-modeling approach collapses high-fidelity rendering, semantic text precision, and action-conditional temporal dynamics into a single, self-contained VLM, avoiding the multi-stage dependencies of prior visual WMs. The open-weight architecture and data repurposing pipeline facilitate both supervised and reinforcement learning applications, enabling interactive, high-throughput GUI simulators. Empirical evidence suggests that continued data scaling, further refinements in code synthesis, and more sophisticated reasoning supervision will yield persistent gains, advancing the frontier in mobile GUI agent simulation and policy learning (Koh et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Data Generation Framework (gWorld).