GUI Grounding Pre-training

Updated 13 February 2026

GUI Grounding Pre-training is a methodology where multimodal models learn to link natural language instructions with GUI elements through annotated screenshot corpora.
It leverages innovative data curation, modular transformer architectures, and techniques like LoRA and tuning-free attention to achieve high localization accuracy.
The approach incorporates multi-stage training with curriculum and reinforcement learning strategies to significantly boost GUI automation and reasoning performance.

Graphical User Interface (GUI) grounding pre-training refers to a set of methodologies by which multimodal language and vision models acquire the capability to accurately associate natural-language instructions with actionable visual interface elements, such as buttons, icons, or text boxes, within screenshots or rendered GUI images. This capability is foundational for visual GUI agents that automate user interactions across platforms—including desktop, web, and mobile—without access to structured data sources like DOM trees or accessibility overlays. The pre-training phase imparts element localization and alignment skills, generally leveraging large annotated screenshot corpora and a variety of surrogate objectives, and has been empirically shown to strongly mediate downstream success on GUI automation and reasoning tasks. Recent research advances span data engineering, architectural innovations, loss design, curriculum and reinforcement-based strategies, and even tuning-free approaches that exploit pretrained attention without additional gradient updates.

1. Foundations of GUI Grounding Pre-training

GUI grounding is formally modeled as a perception task: the system receives a screenshot $S \in \mathbb{R}^{H \times W \times 3}$ and a textual query $x$ (“Click the clock icon”), and is required to predict a corresponding location $y$ (typically a bounding box or point in normalized coordinates), i.e., to maximize $P(y\,|\,S, x)$ (Hui et al., 27 Jan 2025, Cheng et al., 2024). An alternative "reverse grounding" task also emerges, where, given a region $y$ on $S$ , the model predicts the associated semantic description $x$ . These bidirectional associations serve both as supervised objectives and as coordination signals to bridge the model’s visual and linguistic representations.

Historically, grounding has been treated as a precursor and bottleneck for higher-level GUI reasoning and planning in agents. Effective pre-training for GUI grounding yields significant lifts in clicking accuracy, element retrieval, and even multi-step automation performance on challenging agent benchmarks. Models such as SeeClick, WinClick, ZonUI-3B (Qwen-GUI-3B), UI-Ins, POINTS-GUI-G, and recent Tuning-free Attention-driven Grounding (TAG) methods have built out successive advances in this space (Hsieh et al., 30 Jun 2025, Chen et al., 23 Oct 2025, Xu et al., 2024).

2. Data Engineering: Cross-Platform, Diversity, and Automated Alignment

High-quality data curation is critical to GUI grounding pre-training. Leading pipelines rely on heterogeneous screenshot corpora spanning web, desktop, and mobile interfaces, often sourced from publicly available datasets, vendor APIs, or synthetic renders:

Corpus Construction and Diversity: ZonUI-3B uses a balanced 24.1K example corpus drawn from four sources (ShowUI-Web, UGround-WebHybrid, AMEX, ShowUI-Desktop), evenly divided between platform classes and spanning resolutions from 448×448 to 1344×1344 pixels (Hsieh et al., 30 Jun 2025). POINTS-GUI-G unifies 13 public datasets and several synthetic collections, with all region coordinates normalized to [0,1]—essential for network invariance to image size (Zhao et al., 6 Feb 2026).
Redundancy Reduction and Sampling: Empirically, random sampling is often sufficient to prune duplicated GUI patterns. ZonUI-3B demonstrates that subsampling from 120K to 16.1K yields near-identical accuracy (82.8% vs 82.9% on ScreenSpot), substantially reducing compute without manual annotation (Hsieh et al., 30 Jun 2025).
Automated Region Alignment: WinClick employs both a ViT-BERT region proposal model and large vision-LLMs (e.g., GPT-4o) to label screenshots with high-fidelity (description, region) pairs entirely without DOM/HTML access (Hui et al., 27 Jan 2025). Synthetic instruction rewriting (UI-Ins) and GPT-based verification further raise instruction quality and diversity (reducing flaw rates from 23% downwards) (Chen et al., 23 Oct 2025).
Difficulty Grading: The POINTS-GUI-G pipeline introduces layout entropy as a quantitative measure of spatial complexity, enabling curriculum schedules and targeted augmentation for rare “hard” cases, such as high-density or occluded desktop windows (Zhao et al., 6 Feb 2026).

3. Model Architectures and Efficient Parameterization

Modern GUI grounding models are based on vision-language transformers capable of multi-resolution input and direct coordinate prediction, typically instantiated as large or medium-scale multimodal LLMs (e.g., Qwen2.5-VL-3B/7B/32B, Phi3-Vision, MiniCPM-Llama3-V 2.5):

Modular Fusion: Visual encoders (ViT or similar) produce patchwise embeddings, which are then fused with encoded instructions via dedicated cross-attention modules or concatenated in the transformer input (Hui et al., 27 Jan 2025, Cheng et al., 2024, Xu et al., 2024).
Lightweight Fine-tuning: LoRA (Low-Rank Adaptation) modules permit parameter-efficient pre-training. In ZonUI-3B, all transformer weights are frozen and LoRA updaters with rank $r=8$ , scaling factor $\alpha=16$ are injected, adding only $2rd$ parameters per matrix (Hsieh et al., 30 Jun 2025, Cheng et al., 2024).
Continuous Vision Encoder Adaptation: POINTS-GUI-G finds that unfreezing and fine-tuning the vision encoder yields a 4–7 point accuracy gain across benchmarks, especially for perception-intensive scenarios (Zhao et al., 6 Feb 2026).
Tuning-free Grounding: TAG demonstrates that, in sufficiently pretrained MLLMs, spatial attention maps can be directly extracted and aggregated to yield state-of-the-art grounding performance without any further updates to weights (Xu et al., 2024).

4. Training Strategies: Curriculum, Multi-stage, and RL

Supervised fine-tuning remains the predominant strategy, often enhanced with carefully considered curricula and specialization phases:

Two-stage Fine-tuning: ZonUI-3B’s protocol consists of an initial cross-platform pre-training, followed by specialization on high-resolution (mainly desktop) data, improving ScreenSpot accuracy from 82.8% to 84.9% and yielding a 3.3 percentage point boost for desktop use cases (Hsieh et al., 30 Jun 2025).
Curriculum Schedules: Both POINTS-GUI-G and UI-Ins present “easy” to “hard” examples sequentially within mini-batches, leveraging entropy-derived or perspective-based grading (Zhao et al., 6 Feb 2026, Chen et al., 23 Oct 2025).
Reinforcement Learning with Verifiable Rewards: RL objectives are used in POINTS-GUI-G and UI-Ins not for high-level reasoning, but for spatial precision in grounding. Binary point-in-box rewards are directly computable from screenshots, enabling Group Relative Policy Optimization (GRPO) with robust, stable curriculum learning (Zhao et al., 6 Feb 2026, Chen et al., 23 Oct 2025). RL provides an additional uplift following SFT, especially when initialized with diverse, instruction-as-reasoning pathways.
Multi-task and Pivot Approaches: To close the gap between coordinate-oriented grounding and action-oriented reasoning, query-inference “pivot” tasks are integrated: the model is trained not just to map instruction→coordinate, but also coordinate→instruction, aligning with downstream reasoning objectives at minimal data cost (<0.1% of the data of OS-Atlas) (Wu et al., 1 Mar 2025).

5. Loss Functions, Objectives, and Multi-Perspective Reasoning

Supervised Loss Design: Localization is optimized via cross-entropy on discretized bins or $x$ 0/ $x$ 1 regression for coordinates, sometimes with additional losses on region classification (element ID prediction) or bounding-box overlap (IoU-based) (Hsieh et al., 30 Jun 2025, Cheng et al., 2024, Hui et al., 27 Jan 2025).
Reverse and Generation Losses: Reverse grounding (region→description), OCR, and summarization tasks are trained jointly via standard sequence-level cross-entropy, sometimes weighted to encourage task diversity (Hui et al., 27 Jan 2025, Cheng et al., 2024).
Multi-task and Semantic Integration: Multi-perspective or instruction-as-reasoning frameworks (UI-Ins) compose and select among multiple analytic instruction pathways (appearance, function, spatial, and intent), improving grounding accuracy and robustness to ambiguous user queries (Chen et al., 23 Oct 2025).
Loss Aggregation: Combined objectives generally take the form:

$x$ 2

Where task weights are tuned for empirical balance or left equal in multi-task scenarios.

6. Benchmarks, Quantitative Results, and Ablation Insights

The impact of GUI grounding pre-training is established via a suite of new and established benchmarks. Common metrics are click accuracy (point-in-box), bounding box IoU, and agentic task completion rate:

ScreenSpot and ScreenSpot-Pro: Serve as realistic, multi-platform element localization evaluations. ZonUI-3B attains 84.9% (ScreenSpot) and 86.4% (ScreenSpot-v2) (Hsieh et al., 30 Jun 2025). UI-Ins-32B achieves 57.0% on the more challenging ScreenSpot-Pro (Chen et al., 23 Oct 2025).
WinSpot: For Windows-specific GUI grounding, pre-training boosts Phi3-Vision from 5.9% to 56.2% click accuracy (full fine-tuning) (Hui et al., 27 Jan 2025).
Empirical Effects of Training Choices: Ablations in ZonUI-3B and SeeClick quantify the roles of data diversity, LoRA adaptation, and curriculum. Balanced platform sampling gives +1.3% for desktop localization; LoRA fine-tuning of the vision encoder provides +3% (Cheng et al., 2024, Hsieh et al., 30 Jun 2025).
Correlation with Downstream Automation: Improvements in grounding map linearly (Pearson’s $x$ 3) to task performance in agentic benchmarks (MiniWob, Mind2Web, AITW) (Cheng et al., 2024).
Tuning-free Baselines: TAG outperforms fine-tuned SeeClick on text localization and rivals supervised agents on objective benchmarks, revealing that substantial grounding capacity is latent in large pre-trained MLLMs with robust OCR pre-training (Xu et al., 2024).

Model/Method	ScreenSpot (%)	ScreenSpot-Pro (%)	WinSpot (%)	Notable Features
ZonUI-3B (Qwen-GUI-3B)	84.9	---	---	2-stage SFT, LoRA, cross-platform corpus
UI-Ins-32B	---	57.0	---	Multi-perspective SFT+RL, dynamic reasoning
WinClick (Full FT)	---	---	56.2	Bidirectional grounding, GPT-based alignments
SeeClick	53.4	---	15.7	Cross-attention fusion, massive data pipeline
POINTS-GUI-G	95.7 (ScreenSpot-v2)	59.9	---	RL+curriculum, synthetic overlays
TAG (tuning-free)	54.8	---	---	Attention map aggregation, no fine-tuning

7. Emerging Directions and Open Challenges

Fine-grained and Icon Localization: While text localization is now highly accurate (e.g. TAG: 88.3% on mobile text (Xu et al., 2024)), icon and widget localization remain challenging for both fine-tuned and tuning-free approaches.
Economy and Scalability: Methods that maximize learning from small, high-quality datasets (e.g. query inference, instruction diversity) have demonstrated near-equivalence to much larger models trained on orders-of-magnitude more data (Wu et al., 1 Mar 2025, Hsieh et al., 30 Jun 2025, Chen et al., 23 Oct 2025).
Unified Multimodal Agentic Reasoning: Alignment between grounding (coordinate-based) and reasoning (action/policy-based) output spaces remains a frontier. Query-pivot and instruction-as-reasoning paradigms explicitly tackle this gap and provide measurable gains with minimal annotation (Wu et al., 1 Mar 2025, Chen et al., 23 Oct 2025).
Beyond Supervised Pre-training: TAG and related attention-driven techniques may usher in a regime where much grounding ability can be exploited, and even pseudo-labeled, using only generic MLLMs and prompt engineering, suggesting hybrid self-supervised schemes for future pre-training (Xu et al., 2024).
Instruction Quality and Diversity: Empirical analysis reveals a high flaw rate in open-source grounding instructions (23.3%), and that diversity (generating appearance, function, spatial, and intent paraphrases) is critical—yielding up to 76% relative gain when the best instruction pathway is chosen per instance (Chen et al., 23 Oct 2025).

A plausible implication is that efficient and portable GUI agents will increasingly rely on lightweight, platform-agnostic grounding pre-training, with future research focused not solely on model scaling, but on principled data selection, unsupervised region-word alignment, and mechanisms for semantic disambiguation in ambiguous or novel GUI contexts.