Papers
Topics
Authors
Recent
Search
2000 character limit reached

AFRAgent: High-Resolution GUI Automation

Updated 7 December 2025
  • AFRAgent is an advanced multimodal agent for mobile GUI automation that leverages adaptive feature renormalization to fuse low- and high-resolution visual cues.
  • It integrates an instruct-BLIP backbone with Vision Encoder, Q-Former, and AFR blocks to efficiently enhance spatial awareness and precise GUI element localization.
  • Empirical results show AFRAgent achieves state-of-the-art performance with 45% fewer parameters and 60% lower GPU memory usage than comparable models.

AFRAgent is a multimodal agent architecture for high-resolution-aware graphical user interface (GUI) automation, introducing an adaptive feature renormalization mechanism for improved spatial understanding and efficiency in real-world mobile interface tasks. AFRAgent leverages an instruct-BLIP backbone along with a novel feature fusion strategy, setting a new state-of-the-art for mobile smartphone automation on leading benchmarks while using substantially fewer parameters and compute than prior approaches (Anand et al., 30 Nov 2025).

1. Architectural Overview

AFRAgent employs an instruct-BLIP-based architecture comprised of four main modules: a Vision Encoder (E), Query-Former (Q), Adaptive Feature Renormalization (AFR) blocks, and a LLM head (L). The pipeline processes an input screenshot StS_t as follows:

  1. Patch Embedding: The input screenshot StS_t is encoded into low-resolution patch embeddings It=[i1,…,iN]=E(St)I_t=[i_1,\ldots,i_N]=E(S_t), with each in∈RdIi_n\in\mathbb{R}^{d_I}.
  2. Token Concatenation: Learnable query tokens Tq=[t1q,…,tMq]T_q=[t^q_1,\ldots,t^q_M], instruction tokens g(T,Ht)g(T,H_t), and history tokens are combined into XQ=[Tq;g(T,Ht)]∈R(M+LT)×dQX_Q=[T_q;g(T,H_t)]\in\mathbb{R}^{(M+L_T)\times d_Q}.
  3. Q-Former Processing: XQX_Q and ItI_t are processed via ZZ layers of self- and cross-attention, yielding E(Z)∈RM×dQE^{(Z)}\in\mathbb{R}^{M\times d_Q}.
  4. Low- and High-Resolution AFR:
    • Low-res AFR combines E(Z)E^{(Z)} and ItI_t to enrich features.
    • High-res AFR further enriches by processing CC high-res crops of StS_t.
  5. LLM Preparation: The final enriched features EfinalE_\text{final} are projected to the LLM input space and concatenated with g(T,Ht)g(T,H_t): XLLM=[Efinal;g(T,Ht)]X_\mathrm{LLM}=[E_\text{final};g(T,H_t)].
  6. Action Prediction: The frozen LLM LL predicts the next action a^t+1\hat a_{t+1} via standard autoregressive cross-entropy loss.

All multimodal fusion is achieved in the Q-Former/AFR pipeline, with the LLM operating in a modality-agnostic, frozen capacity. No cross-attention or image-text fusion occurs inside the LLM itself.

2. Adaptive Feature Renormalization (AFR)

AFR introduces a token-level affine transformation mechanism for feature fusion that improves context and precision in GUI element localization. Previous fusion strategies (LLaVA-style direct projection, Q-Former-only, Soft Mixture-of-Experts, BLIVA) suffered from limited visual grounding or excessive token/computation costs. AFR is inspired by affine modulation techniques such as AdaIN and SPADE, but conditions the affine parameters on features at both low and high resolutions, producing token-specific scale and shift vectors per input instance.

Given enriching features Fenrich∈RM×dF_{\mathrm{enrich}}\in\mathbb{R}^{M\times d} and target features FtargetF_{\mathrm{target}}, two parameter heads produce scale α=FFNα(Fenrich)\alpha=\mathrm{FFN}_\alpha(F_{\mathrm{enrich}}) and shift β=FFNβ(Fenrich)\beta=\mathrm{FFN}_\beta(F_{\mathrm{enrich}}), resulting in

Fenriched=α⊙Ftarget+β.F_\mathrm{enriched} = \alpha \odot F_{\mathrm{target}} + \beta.

For each token ei∈Rde_i \in \mathbb{R}^d, the renormalization is

ei′=αi⊙ei+βi,e_i' = \alpha_i \odot e_i + \beta_i,

with αi,βi∈Rd\alpha_i, \beta_i \in \mathbb{R}^d. Grad-CAM visualizations indicate that this approach yields sharper, more task-directed attention heatmaps focused on relevant GUI elements.

3. High-Resolution Feature Fusion

AFRAgent employs a two-stage enrichment pipeline:

3.1 Low-Resolution Enrichment

Patch embeddings ItI_t and Q-Former output E(Z)E^{(Z)} are fused using AFR:

αimg,βimg=FFNα(It),  FFNβ(It),\alpha_\mathrm{img}, \beta_\mathrm{img} = \mathrm{FFN}_\alpha(I_t),\;\mathrm{FFN}_\beta(I_t),

Eimg=αimg⊙E(Z)+βimg.E_\mathrm{img} = \alpha_\mathrm{img} \odot E^{(Z)} + \beta_\mathrm{img}.

3.2 High-Resolution Enrichment

  • StS_t is upsampled and split into CC crops {s(c)}\{s^{(c)}\}, each encoded into patch sequences I(c)I^{(c)}.
  • New learnable queries T~q\tilde T_q are introduced, and [T~q;g(T,Ht)][\tilde T_q;g(T,H_t)] along with all I(c)I^{(c)} are fed through the Q-Former.
  • High-resolution AFR parameters are computed:

E~(Z)=Q([T~q;g(T,Ht)],{I(c)}),\tilde E^{(Z)} = Q([\tilde T_q;g(T,H_t)],\{I^{(c)}\}),

αhigh,βhigh=FFNα(E~(Z)),  FFNβ(E~(Z)),\alpha_\mathrm{high}, \beta_\mathrm{high} = \mathrm{FFN}_\alpha(\tilde E^{(Z)}),\;\mathrm{FFN}_\beta(\tilde E^{(Z)}),

Ehigh=αhigh⊙Eimg+βhigh.E_\mathrm{high} = \alpha_\mathrm{high} \odot E_\mathrm{img} + \beta_\mathrm{high}.

This hierarchical fusion allows AFRAgent to incorporate both global and local high-resolution visual cues, addressing the challenge of poor widget localization in prior VLM-based GUI agents.

4. Computational Efficiency and Model Size

AFRAgent comprises 4.03 billion parameters, making it significantly smaller than contemporary architectures such as CogAgent (18.3B), CoCo-Agent (7.3B), and SphAgent (7B), and comparable to the InstructBLIP baseline (4B). The computational and GPGPU requirements are summarized below:

Model Params (B) TFLOPS Inference Time (s) Relative GPU Memory Use
CogAgent 18.3 11.86 3.42 1.0×
CoCo-Agent 7.3 – – –
SphAgent 7.0 – – –
InstructBLIP 4.0 3.19 0.63 –
AFRAgent (Low-Res) 4.0 3.20 0.78 ~0.4×
AFRAgent (High-Res) 4.0 5.47 1.24 ~0.4×

AFRAgent achieves approximately 60% lower GPU memory usage per inference step compared to CogAgent. High-resolution AFR increases FLOPS by approximately 70% relative to low-res AFR, but remains substantially more efficient than approaches with full-decoder cross-attention such as CogAgent.

5. Empirical Evaluation and Ablations

5.1 Meta-GUI Benchmark

AFRAgent establishes a new state-of-the-art on the Meta-GUI benchmark (1k episodes, 18k steps):

Method Params (B) Action Acc. (Item/Dir.) Input Text F1 CR (%)
LayoutLM 0.34 82.22 / 71.98 50.43 67.76
BERT 0.34 87.52 / 82.84 62.19 78.42
CoCo-Agent 7.3 92.59 / 91.72 65.90 88.27
AFRAgent 4.0 93.28 / 95.06 67.60 90.83

AFRAgent achieves a +2.56 percentage point increase in completion rate and +3.34 percentage point increase in item accuracy over CoCo-Agent, using 45% fewer parameters.

5.2 AITW Benchmark

Results for structured layout and pure multimodal settings:

Method Params (B) General Install GoogleApps Single WebShop Overall
CoCo-Agent 7.3 70.96 81.46 76.45 91.41 75.00 79.05
AFRAgent 4.0 71.62 80.81 76.26 90.78 75.10 78.92
Method Params (B) General Install GoogleApps Single WebShop Overall
Auto-GUI 4.5 68.24 76.89 71.37 84.58 70.26 74.27
LLaVA 7.3 58.93 72.41 70.81 83.73 65.98 70.37
InstructBLIP 4.0 70.66 79.59 73.05 84.99 72.26 76.11
CogAgent 18.3 65.38 78.86 74.95 93.49 71.73 76.88
AFRAgent 4.0 70.67 80.89 74.16 91.06 73.27 78.01

5.3 Fusion Strategies and Ablation Analysis

Ablation studies on AITW (General/Single) indicate that AFR outperforms residual and Mixture-of-Experts fusion at comparable or lower computational cost, and that high-res AFR yields a further +0.85 percentage point improvement over low-res at +70% FLOPS.

Method FLOPS (T) Params (B) Fusion General Single High-Res FLOPS Overall
Residual 3.2 4.03 add 69.75 85.13 – 77.44
Soft-MoE 3.54 4.03 MoE 70.20 85.25 – 77.73
highResProj 17.08 8.29 direct proj 70.91 86.17 – 78.54
AFR Low-Res 3.20 4.03 AFR 70.15 85.37 – 77.76
AFR High-Res 5.47 4.03 AFR 70.91 86.30 5.47 78.61

Grad-CAM overlays highlight AFR’s superior spatial focus on interactable GUI elements.

6. Limitations and Future Directions

AFR high-resolution processing incurs approximately 70% more computational cost than the low-res variant; dynamic or content-aware crop selection could ameliorate this. Crop generation is currently uniform; learnable strategies might further improve the accuracy and computation trade-off. Measurements of on-device latency are lacking and require future empirical profiling. The model’s reliance on InstructBLIP pretraining limits its ability to fully exploit GUI-specific grounding; end-to-end GUI pretraining is noted as a key avenue for improvement. Extending AFR to integrate heterogeneous modalities, such as OCR or accessibility tokens, presents an additional research opportunity for enhanced robustness (Anand et al., 30 Nov 2025).

7. Context and Relevance in GUI Automation Research

The development of AFRAgent responds to the need for robust, device-independent, visually grounded mobile GUI automation. Its design addresses the spatial awareness limitations inherent to low-resolution patch features and the prohibitive compute burden of prior VLM-based agents. Through adaptive token-level feature renormalization, AFRAgent achieves stronger visual-textual grounding and efficiency, setting a new baseline on multiple established benchmarks. This architecture provides a blueprint for subsequent work in efficient, high-resolution multimodal agents for real-world human-computer interaction contexts, and marks a convergence of scalable vision-language modeling and advanced feature modulation (Anand et al., 30 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AFRAgent.