AFRAgent: High-Resolution GUI Automation

Updated 7 December 2025

AFRAgent is an advanced multimodal agent for mobile GUI automation that leverages adaptive feature renormalization to fuse low- and high-resolution visual cues.
It integrates an instruct-BLIP backbone with Vision Encoder, Q-Former, and AFR blocks to efficiently enhance spatial awareness and precise GUI element localization.
Empirical results show AFRAgent achieves state-of-the-art performance with 45% fewer parameters and 60% lower GPU memory usage than comparable models.

AFRAgent is a multimodal agent architecture for high-resolution-aware graphical user interface (GUI) automation, introducing an adaptive feature renormalization mechanism for improved spatial understanding and efficiency in real-world mobile interface tasks. AFRAgent leverages an instruct-BLIP backbone along with a novel feature fusion strategy, setting a new state-of-the-art for mobile smartphone automation on leading benchmarks while using substantially fewer parameters and compute than prior approaches (Anand et al., 30 Nov 2025).

1. Architectural Overview

AFRAgent employs an instruct-BLIP-based architecture comprised of four main modules: a Vision Encoder (E), Query-Former (Q), Adaptive Feature Renormalization (AFR) blocks, and a LLM head (L). The pipeline processes an input screenshot $S_t$ as follows:

Patch Embedding: The input screenshot $S_t$ is encoded into low-resolution patch embeddings $I_t=[i_1,\ldots,i_N]=E(S_t)$ , with each $i_n\in\mathbb{R}^{d_I}$ .
Token Concatenation: Learnable query tokens $T_q=[t^q_1,\ldots,t^q_M]$ , instruction tokens $g(T,H_t)$ , and history tokens are combined into $X_Q=[T_q;g(T,H_t)]\in\mathbb{R}^{(M+L_T)\times d_Q}$ .
Q-Former Processing: $X_Q$ and $I_t$ are processed via $Z$ layers of self- and cross-attention, yielding $E^{(Z)}\in\mathbb{R}^{M\times d_Q}$ .
Low- and High-Resolution AFR:
- Low-res AFR combines $E^{(Z)}$ and $I_t$ to enrich features.
- High-res AFR further enriches by processing $C$ high-res crops of $S_t$ .
LLM Preparation: The final enriched features $E_\text{final}$ are projected to the LLM input space and concatenated with $g(T,H_t)$ : $X_\mathrm{LLM}=[E_\text{final};g(T,H_t)]$ .
Action Prediction: The frozen LLM $L$ predicts the next action $\hat a_{t+1}$ via standard autoregressive cross-entropy loss.

All multimodal fusion is achieved in the Q-Former/AFR pipeline, with the LLM operating in a modality-agnostic, frozen capacity. No cross-attention or image-text fusion occurs inside the LLM itself.

2. Adaptive Feature Renormalization (AFR)

AFR introduces a token-level affine transformation mechanism for feature fusion that improves context and precision in GUI element localization. Previous fusion strategies (LLaVA-style direct projection, Q-Former-only, Soft Mixture-of-Experts, BLIVA) suffered from limited visual grounding or excessive token/computation costs. AFR is inspired by affine modulation techniques such as AdaIN and SPADE, but conditions the affine parameters on features at both low and high resolutions, producing token-specific scale and shift vectors per input instance.

Given enriching features $F_{\mathrm{enrich}}\in\mathbb{R}^{M\times d}$ and target features $F_{\mathrm{target}}$ , two parameter heads produce scale $\alpha=\mathrm{FFN}_\alpha(F_{\mathrm{enrich}})$ and shift $\beta=\mathrm{FFN}_\beta(F_{\mathrm{enrich}})$ , resulting in

$F_\mathrm{enriched} = \alpha \odot F_{\mathrm{target}} + \beta.$

For each token $e_i \in \mathbb{R}^d$ , the renormalization is

$e_i' = \alpha_i \odot e_i + \beta_i,$

with $\alpha_i, \beta_i \in \mathbb{R}^d$ . Grad-CAM visualizations indicate that this approach yields sharper, more task-directed attention heatmaps focused on relevant GUI elements.

3. High-Resolution Feature Fusion

AFRAgent employs a two-stage enrichment pipeline:

3.1 Low-Resolution Enrichment

Patch embeddings $I_t$ and Q-Former output $E^{(Z)}$ are fused using AFR:

$\alpha_\mathrm{img}, \beta_\mathrm{img} = \mathrm{FFN}_\alpha(I_t),\;\mathrm{FFN}_\beta(I_t),$

$E_\mathrm{img} = \alpha_\mathrm{img} \odot E^{(Z)} + \beta_\mathrm{img}.$

3.2 High-Resolution Enrichment

$S_t$ is upsampled and split into $C$ crops $\{s^{(c)}\}$ , each encoded into patch sequences $I^{(c)}$ .
New learnable queries $\tilde T_q$ are introduced, and $[\tilde T_q;g(T,H_t)]$ along with all $I^{(c)}$ are fed through the Q-Former.
High-resolution AFR parameters are computed:

$\tilde E^{(Z)} = Q([\tilde T_q;g(T,H_t)],\{I^{(c)}\}),$

$\alpha_\mathrm{high}, \beta_\mathrm{high} = \mathrm{FFN}_\alpha(\tilde E^{(Z)}),\;\mathrm{FFN}_\beta(\tilde E^{(Z)}),$

$E_\mathrm{high} = \alpha_\mathrm{high} \odot E_\mathrm{img} + \beta_\mathrm{high}.$

This hierarchical fusion allows AFRAgent to incorporate both global and local high-resolution visual cues, addressing the challenge of poor widget localization in prior VLM-based GUI agents.

4. Computational Efficiency and Model Size

AFRAgent comprises 4.03 billion parameters, making it significantly smaller than contemporary architectures such as CogAgent (18.3B), CoCo-Agent (7.3B), and SphAgent (7B), and comparable to the InstructBLIP baseline (4B). The computational and GPGPU requirements are summarized below:

Model	Params (B)	TFLOPS	Inference Time (s)	Relative GPU Memory Use
CogAgent	18.3	11.86	3.42	1.0×
CoCo-Agent	7.3	–	–	–
SphAgent	7.0	–	–	–
InstructBLIP	4.0	3.19	0.63	–
AFRAgent (Low-Res)	4.0	3.20	0.78	~0.4×
AFRAgent (High-Res)	4.0	5.47	1.24	~0.4×

AFRAgent achieves approximately 60% lower GPU memory usage per inference step compared to CogAgent. High-resolution AFR increases FLOPS by approximately 70% relative to low-res AFR, but remains substantially more efficient than approaches with full-decoder cross-attention such as CogAgent.

5. Empirical Evaluation and Ablations

5.1 Meta-GUI Benchmark

AFRAgent establishes a new state-of-the-art on the Meta-GUI benchmark (1k episodes, 18k steps):

Method	Params (B)	Action Acc. (Item/Dir.)	Input Text F1	CR (%)
LayoutLM	0.34	82.22 / 71.98	50.43	67.76
BERT	0.34	87.52 / 82.84	62.19	78.42
CoCo-Agent	7.3	92.59 / 91.72	65.90	88.27
AFRAgent	4.0	93.28 / 95.06	67.60	90.83

AFRAgent achieves a +2.56 percentage point increase in completion rate and +3.34 percentage point increase in item accuracy over CoCo-Agent, using 45% fewer parameters.

5.2 AITW Benchmark

Results for structured layout and pure multimodal settings:

Method	Params (B)	General	Install	GoogleApps	Single	WebShop	Overall
CoCo-Agent	7.3	70.96	81.46	76.45	91.41	75.00	79.05
AFRAgent	4.0	71.62	80.81	76.26	90.78	75.10	78.92

Method	Params (B)	General	Install	GoogleApps	Single	WebShop	Overall
Auto-GUI	4.5	68.24	76.89	71.37	84.58	70.26	74.27
LLaVA	7.3	58.93	72.41	70.81	83.73	65.98	70.37
InstructBLIP	4.0	70.66	79.59	73.05	84.99	72.26	76.11
CogAgent	18.3	65.38	78.86	74.95	93.49	71.73	76.88
AFRAgent	4.0	70.67	80.89	74.16	91.06	73.27	78.01

5.3 Fusion Strategies and Ablation Analysis

Ablation studies on AITW (General/Single) indicate that AFR outperforms residual and Mixture-of-Experts fusion at comparable or lower computational cost, and that high-res AFR yields a further +0.85 percentage point improvement over low-res at +70% FLOPS.

Method	FLOPS (T)	Params (B)	Fusion	General	Single	High-Res FLOPS	Overall
Residual	3.2	4.03	add	69.75	85.13	–	77.44
Soft-MoE	3.54	4.03	MoE	70.20	85.25	–	77.73
highResProj	17.08	8.29	direct proj	70.91	86.17	–	78.54
AFR Low-Res	3.20	4.03	AFR	70.15	85.37	–	77.76
AFR High-Res	5.47	4.03	AFR	70.91	86.30	5.47	78.61

Grad-CAM overlays highlight AFR’s superior spatial focus on interactable GUI elements.

6. Limitations and Future Directions

AFR high-resolution processing incurs approximately 70% more computational cost than the low-res variant; dynamic or content-aware crop selection could ameliorate this. Crop generation is currently uniform; learnable strategies might further improve the accuracy and computation trade-off. Measurements of on-device latency are lacking and require future empirical profiling. The model’s reliance on InstructBLIP pretraining limits its ability to fully exploit GUI-specific grounding; end-to-end GUI pretraining is noted as a key avenue for improvement. Extending AFR to integrate heterogeneous modalities, such as OCR or accessibility tokens, presents an additional research opportunity for enhanced robustness (Anand et al., 30 Nov 2025).

7. Context and Relevance in GUI Automation Research

The development of AFRAgent responds to the need for robust, device-independent, visually grounded mobile GUI automation. Its design addresses the spatial awareness limitations inherent to low-resolution patch features and the prohibitive compute burden of prior VLM-based agents. Through adaptive token-level feature renormalization, AFRAgent achieves stronger visual-textual grounding and efficiency, setting a new baseline on multiple established benchmarks. This architecture provides a blueprint for subsequent work in efficient, high-resolution multimodal agents for real-world human-computer interaction contexts, and marks a convergence of scalable vision-language modeling and advanced feature modulation (Anand et al., 30 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AFRAgent.

AFRAgent: High-Resolution GUI Automation

1. Architectural Overview

2. Adaptive Feature Renormalization (AFR)

3. High-Resolution Feature Fusion

3.1 Low-Resolution Enrichment

3.2 High-Resolution Enrichment

4. Computational Efficiency and Model Size

5. Empirical Evaluation and Ablations

5.1 Meta-GUI Benchmark

5.2 AITW Benchmark

5.3 Fusion Strategies and Ablation Analysis

6. Limitations and Future Directions

7. Context and Relevance in GUI Automation Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

AFRAgent: High-Resolution GUI Automation

1. Architectural Overview

2. Adaptive Feature Renormalization (AFR)

3. High-Resolution Feature Fusion

3.1 Low-Resolution Enrichment

3.2 High-Resolution Enrichment

4. Computational Efficiency and Model Size

5. Empirical Evaluation and Ablations

5.1 Meta-GUI Benchmark

5.2 AITW Benchmark

5.3 Fusion Strategies and Ablation Analysis

6. Limitations and Future Directions

7. Context and Relevance in GUI Automation Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research