Decoupled Atomic Training for GUI Agents

Updated 2 December 2025

Decoupled Atomic Training is a reinforcement learning framework that decomposes training into four independent modules for asynchronous, non-blocking GUI task handling.
Its architecture leverages an environment cluster, rollout service, data manager, and trainer to maximize hardware use and double throughput compared to traditional methods.
Empirical results on the OSWorld benchmark show significant gains in task success and GPU utilization, demonstrating the framework’s adaptive, efficient design.

Decoupled Atomic Training (DART) is a reinforcement learning (RL) framework for vision-LLM (VLM) based graphical user interface (GUI) agents, characterized by a highly modular asynchronous architecture and adaptive data curation to address the challenges of long-horizon, sparse-reward GUI tasks. DART decomposes the RL pipeline into four independently operating modules, enabling non-blocking communication and significantly improved resource utilization. This strategy addresses inefficiencies endemic to conventional, tightly coupled RL pipelines and underpins state-of-the-art results on the OSWorld benchmark (Li et al., 28 Sep 2025).

1. Motivation and Core Principles

Conventional RL pipelines for GUI agents involve rollout generation, data buffering, and model updates proceeding in strict sequence. In such a coupled system, GPU resources and environments frequently remain idle, either awaiting environment responses or model updates. This inefficiency is exacerbated by the intrinsically long-horizon and delayed-feedback nature of GUI tasks, which commonly require tens of interaction steps and yield sparse rewards.

Decoupled Atomic Training breaks this paradigm by refactoring the RL system into four atomic, asynchronously communicating modules:

Environment Cluster: Hundreds of real or simulated desktops, each running a separate GUI task instance.
Rollout Service: A GPU-backed model-inference pool supporting asynchronous inference requests.
Data Manager: A centralized, lightweight MySQL-backed trajectory store, also responsible for adaptive data curation logic.
Trainer: An independently scheduled step-wise GRPO (Generalized Regularized Policy Optimization) trainer on its own GPU cluster.

All inter-module communication proceeds via non-blocking message passing. This architecture ensures that no single component idles awaiting the completion of another's task, maximizing throughput and hardware utilization.

2. Asynchronous System Architecture

The DART system design is best summarized by its atomic, asynchronous message-passing schedule:

$p_\tau$ 1

An equivalent sketch in Python-style pseudocode:

$p_\tau$ 2

Each module proceeds independently, with only per-worker model synchronization during weight updates, avoiding global "stop-the-world" stalls.

3. Mathematical Framework: Rollout Sampling and Adaptive Curation

3.1 Dynamic Rollout and Trajectory Length

Rollout allocation for each task $\tau$ is dynamically determined based on recent success rates $p_\tau$ : $N(\tau) = \begin{cases} N_{\max}, & p_\tau \le p_{\rm low} \ N_{\min}, & p_\tau \ge p_{\rm high} \ N_{\min} + \frac{(1 - p_\tau)(N_{\max}-N_{\min})}{1 - p_{\rm high}}, & p_{\rm low}<p_\tau<p_{\rm high} \end{cases}$ The maximum trajectory length per task is capped by the longest successful trajectory: $L_{\max}(\tau) = \min\left(L_{\rm global},\; \max_{i: R_i > 0} L_i \right)$

3.2 Multi-level Adaptive Data Curation

Experience Pool Supplementation: If all $N(\tau)$ recent online rollouts for a task achieve zero reward, a successful trajectory is drawn from a pre-collected pool $\mathcal P(\tau)$ , ensuring at least one positive example per batch.
High-Entropy Step Selection: For step $t$ ,

$H_{t,i} = -\sum_{v=1}^V p_{t,i,v}\,\log p_{t,i,v}, \quad H_t = \frac{1}{|r_t|+|a_t|} \sum_{i=1}^{|r_t|+|a_t|} H_{t,i}$

Only the top 80% highest entropy steps participate in the GRPO update.

Truncated Importance Sampling:

$w(h,s,a) = \min\left\{\, \frac{\pi^{\rm Train}_{\theta_{\rm old}}(a|h,s)}{\pi^{\rm Rollout}_{\theta_{\rm old}}(a|h,s)},\; C \right\},\quad C > 0$

This accounts for mismatches between rollout and training policy.

The step-wise GRPO objective is: $\begin{aligned} \mathcal J(\theta) &= \mathbb E_{(h,s,a,R)\in\mathcal D}\Big[ \mathbb I[H_t\ge\tau_{0.2}]\,w(h,s,a)\, \min\left(\rho_{\theta},\,\mathrm{clip}(\rho_{\theta},1-\epsilon,1+\epsilon)\right)A \Big] \ &\quad - \beta\,\mathrm{KL}(\pi_\theta^{\rm Train}\,\|\,\pi_\theta^{\rm Ref}) \ \rho_{\theta} &= \frac{\pi_\theta^{\rm Train}(a|h,s)}{\pi_{\rm old}^{\rm Train}(a|h,s)}, \quad A = \frac{R - \bar R}{\sigma_R} \end{aligned}$ where normalization over rewards is obtained as: $p_\tau$ 0

4. System Efficiency Gains

Decoupled Atomic Training demonstrates significant efficiency improvements over coupled baselines. The following metrics are reported on identical hardware (Li et al., 28 Sep 2025):

Metric	Baseline	DART (Ours)	Improvement
Training Throughput	22.6 actions/min	43.6 actions/min	1.9×
Env Utilization	12.2%	67.7%	5.5×
GPU Utilization	29.6%	46.7%	1.6×

These gains arise primarily from rollout-wise environment scheduling, fine-grained trajectory assignment, and per-worker model synchronization that eliminates global stalls.

5. Empirical Results on the OSWorld Benchmark

DART-GUI-7B, initialized from UI-TARS-1.5-7B and trained using the decoupled atomic approach, was evaluated on the 203-task OSWorld-Verified suite with the following results:

Overall Task Success Rate: 42.13%
- +14.61% absolute over the UI-TARS-1.5-7B baseline (27.52%)
- +7.34% over the prior open-source state-of-the-art (OpenCUA-32B, 34.79%)
Per-task Examples:
- OS tasks: 31.25% → 62.50% (+31.25%)
- LibreOffice Writer: 39.13% → 60.86% (+21.73%)
- Thunderbird: 40.00% → 60.00% (+20.00%)

An ablation study (Pass@1 on 45 tasks) revealed successive component benefits:

Configuration	Pass@1 (%)
Baseline (decoupled)	28.67
+ Dynamic Rollout (DR)	50.90
+ Dynamic Length (DTL)	66.11
+ High-Entropy (HE)	68.33
+ Distribution Alignment (DA)	70.55
All components	72.28

This aggregate demonstrates that both the decoupled atomic architecture and the adaptive data pipeline are essential for achieving robust performance.

6. Significance and Open-Source Availability

Decoupled Atomic Training, through its atomic module design paired with multi-level adaptive data curation, eliminates cross-module stalls and prioritizes learning from challenging, informative data, while robustly correcting for off-policy distribution mismatch. These advances directly translate to improved sample efficiency, hardware utilization, and benchmark performance in multi-turn GUI RL agent training. The complete framework, data, and model checkpoints are made available at [computer-use-agents.github.io/dart-gui], providing reproducibility and extensibility for the open-source agentic RL community (Li et al., 28 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Decoupled Atomic Training.