Decoupled Atomic Training for GUI Agents
- Decoupled Atomic Training is a reinforcement learning framework that decomposes training into four independent modules for asynchronous, non-blocking GUI task handling.
- Its architecture leverages an environment cluster, rollout service, data manager, and trainer to maximize hardware use and double throughput compared to traditional methods.
- Empirical results on the OSWorld benchmark show significant gains in task success and GPU utilization, demonstrating the framework’s adaptive, efficient design.
Decoupled Atomic Training (DART) is a reinforcement learning (RL) framework for vision-LLM (VLM) based graphical user interface (GUI) agents, characterized by a highly modular asynchronous architecture and adaptive data curation to address the challenges of long-horizon, sparse-reward GUI tasks. DART decomposes the RL pipeline into four independently operating modules, enabling non-blocking communication and significantly improved resource utilization. This strategy addresses inefficiencies endemic to conventional, tightly coupled RL pipelines and underpins state-of-the-art results on the OSWorld benchmark (Li et al., 28 Sep 2025).
1. Motivation and Core Principles
Conventional RL pipelines for GUI agents involve rollout generation, data buffering, and model updates proceeding in strict sequence. In such a coupled system, GPU resources and environments frequently remain idle, either awaiting environment responses or model updates. This inefficiency is exacerbated by the intrinsically long-horizon and delayed-feedback nature of GUI tasks, which commonly require tens of interaction steps and yield sparse rewards.
Decoupled Atomic Training breaks this paradigm by refactoring the RL system into four atomic, asynchronously communicating modules:
- Environment Cluster: Hundreds of real or simulated desktops, each running a separate GUI task instance.
- Rollout Service: A GPU-backed model-inference pool supporting asynchronous inference requests.
- Data Manager: A centralized, lightweight MySQL-backed trajectory store, also responsible for adaptive data curation logic.
- Trainer: An independently scheduled step-wise GRPO (Generalized Regularized Policy Optimization) trainer on its own GPU cluster.
All inter-module communication proceeds via non-blocking message passing. This architecture ensures that no single component idles awaiting the completion of another's task, maximizing throughput and hardware utilization.
2. Asynchronous System Architecture
The DART system design is best summarized by its atomic, asynchronous message-passing schedule:
1
An equivalent sketch in Python-style pseudocode:
2
Each module proceeds independently, with only per-worker model synchronization during weight updates, avoiding global "stop-the-world" stalls.
3. Mathematical Framework: Rollout Sampling and Adaptive Curation
3.1 Dynamic Rollout and Trajectory Length
Rollout allocation for each task is dynamically determined based on recent success rates : The maximum trajectory length per task is capped by the longest successful trajectory:
3.2 Multi-level Adaptive Data Curation
- Experience Pool Supplementation: If all recent online rollouts for a task achieve zero reward, a successful trajectory is drawn from a pre-collected pool , ensuring at least one positive example per batch.
- High-Entropy Step Selection: For step ,
Only the top 80% highest entropy steps participate in the GRPO update.
- Truncated Importance Sampling:
This accounts for mismatches between rollout and training policy.
The step-wise GRPO objective is: where normalization over rewards is obtained as: 0
4. System Efficiency Gains
Decoupled Atomic Training demonstrates significant efficiency improvements over coupled baselines. The following metrics are reported on identical hardware (Li et al., 28 Sep 2025):
| Metric | Baseline | DART (Ours) | Improvement |
|---|---|---|---|
| Training Throughput | 22.6 actions/min | 43.6 actions/min | 1.9× |
| Env Utilization | 12.2% | 67.7% | 5.5× |
| GPU Utilization | 29.6% | 46.7% | 1.6× |
These gains arise primarily from rollout-wise environment scheduling, fine-grained trajectory assignment, and per-worker model synchronization that eliminates global stalls.
5. Empirical Results on the OSWorld Benchmark
DART-GUI-7B, initialized from UI-TARS-1.5-7B and trained using the decoupled atomic approach, was evaluated on the 203-task OSWorld-Verified suite with the following results:
- Overall Task Success Rate: 42.13%
- +14.61% absolute over the UI-TARS-1.5-7B baseline (27.52%)
- +7.34% over the prior open-source state-of-the-art (OpenCUA-32B, 34.79%)
- Per-task Examples:
- OS tasks: 31.25% → 62.50% (+31.25%)
- LibreOffice Writer: 39.13% → 60.86% (+21.73%)
- Thunderbird: 40.00% → 60.00% (+20.00%)
An ablation study (Pass@1 on 45 tasks) revealed successive component benefits:
| Configuration | Pass@1 (%) |
|---|---|
| Baseline (decoupled) | 28.67 |
| + Dynamic Rollout (DR) | 50.90 |
| + Dynamic Length (DTL) | 66.11 |
| + High-Entropy (HE) | 68.33 |
| + Distribution Alignment (DA) | 70.55 |
| All components | 72.28 |
This aggregate demonstrates that both the decoupled atomic architecture and the adaptive data pipeline are essential for achieving robust performance.
6. Significance and Open-Source Availability
Decoupled Atomic Training, through its atomic module design paired with multi-level adaptive data curation, eliminates cross-module stalls and prioritizes learning from challenging, informative data, while robustly correcting for off-policy distribution mismatch. These advances directly translate to improved sample efficiency, hardware utilization, and benchmark performance in multi-turn GUI RL agent training. The complete framework, data, and model checkpoints are made available at [computer-use-agents.github.io/dart-gui], providing reproducibility and extensibility for the open-source agentic RL community (Li et al., 28 Sep 2025).