ByteDance UI-TARS System
- ByteDance UI-TARS System is a unified native agent framework that integrates vision, reasoning, and action to autonomously interact with GUIs across desktop, mobile, and web platforms.
- It employs a Mixture-of-Experts Transformer and vision-language model to process screenshots and GUI traces, enabling multi-turn reinforcement learning with explicit System-2 reasoning.
- Iterative data generation through reflective trace bootstrapping, coupled with a hybrid GUI-SDK environment and advanced RL stabilization techniques, drives its state-of-the-art performance and scalability.
ByteDance’s UI-TARS System is an end-to-end native agent framework designed for autonomous interaction with graphical user interfaces (GUIs), unifying visual perception, reasoning, action, and memory within a single policy. The system, comprising UI-TARS and its successor UI-TARS-2, represents a major milestone in scalable, multi-turn reinforcement learning (RL) for GUI agents. It departs from prompt-centric, commercial model wrappers, instead relying on a @@@@1@@@@ backbone and a vision-LLM that directly processes screenshots and GUI traces. The technical architecture leverages a unified action space, advanced System-2 reasoning, and iterative data generation through reflection and reinforcement, targeting robust performance across desktop, mobile, and web platforms (Wang et al., 2 Sep 2025, Qin et al., 21 Jan 2025).
1. Architectural Foundations and Unified Policy Loop
UI-TARS-2 deploys a native agent formulation built on a Mixture-of-Experts Transformer that incorporates a 532M-parameter vision encoder. The policy operates over multimodal observations, memory, and task instructions: where each timestep includes an internal reasoning token , action (GUI or SDK operation), and new observation , with joint trajectory
Memory consists of high-fidelity working memory (last steps) and episodic memory (compressed summaries of past episodes), enabling persistent context across long-horizon interactions. This architecture obviates modular action-perception pipelines, supporting unified end-to-end learning for perception, action grounding, and system-level reasoning (Wang et al., 2 Sep 2025).
2. Data Flywheel and Iterative Trace Bootstrapping
UI-TARS introduces an iterative data flywheel, comprising continual pre-training (CT) on diverse corpora, supervised fine-tuning (SFT) from high-quality human and LLM-annotated traces, and reinforcement learning (RL) on verifiable tasks. After each training cycle , new trajectories are validated:
- High-quality samples () enter
- Lower-quality samples are appended to
This mechanism ensures that
enabling accelerated capability gains. UI-TARS further employs reflective online trace bootstrapping by deploying agents on large virtual device pools, multi-stage filtering (heuristics, VLM scoring, human review), and error correction through direct preference optimization (DPO), where both positive and negative action traces are used for fine-tuned RL (Qin et al., 21 Jan 2025, Wang et al., 2 Sep 2025).
3. System-2 Reasoning and Thought-Augmented Planning
Both UI-TARS and UI-TARS-2 embed explicit System-2 reasoning mechanisms to support complex multi-step decision making. Intermediate "thought" tokens are generated to encode reasoning patterns such as task decomposition, milestone recognition, reflection, and trial-and-error. For UI-TARS: The ActRe annotation protocol generates thoughts causally aligned to actions: During thought bootstrapping, candidate thought–action pairs are sampled from an early checkpoint, selecting those matching the gold action: This explicit reasoning is foundational for long-horizon generalization, error correction, and adaptive planning (Qin et al., 21 Jan 2025, Wang et al., 2 Sep 2025).
4. Reinforcement Learning and Stability Mechanisms
UI-TARS-2 incorporates a stabilized multi-turn RL framework, centered on Proximal Policy Optimization (PPO) with several enhancements:
- Decoupled-GAE: independent for policy and value functions to manage bias/variance
- Length-Adaptive GAE: , where
- Value Pretraining: offline convergence with using Monte Carlo targets
- Clip Higher: distinct upper/lower for PPO clipping to encourage exploration
- Reward shaping and length/format penalties
Asynchronous rollouts, streaming training pools, and stateful environments further stabilize RL in long-horizon scenarios. Analysis demonstrates steady reward and entropy ascent, reduction in reasoning tokens for GUI tasks, efficient interaction rounds, and effective value pretraining. Quantization (W4A8) achieves a 40% latency reduction (from 4.0s to 2.5s) with minimal accuracy loss (Wang et al., 2 Sep 2025).
5. Hybrid GUI-SDK Environment and Sandbox Infrastructure
UI-TARS-2’s environment design extends beyond pure GUI interaction by integrating a hybrid GUI-SDK backend. Agents can:
- Perform on-screen operations
- Access file systems via shell commands
- Invoke SDK tools (e.g., REST API, MCP tools)
- Utilize embedded development environments (terminals, VS Code/Jupyter previews, proxy URLs)
This enables workflows such as immediate post-download file processing and broadens applicability to software engineering and system administration tasks. The sandbox infrastructure uses:
- Cloud VM Sandbox: orchestrating thousands of Windows/Ubuntu/Android VMs
- Browser Sandbox: deterministic, GPU-accelerated containers for WebGL/HTML5 mini-games
Provisioning supports millions of rollouts with real-time monitoring, auto-recovery, and consistent tool APIs (Wang et al., 2 Sep 2025).
6. Empirical Performance across Benchmarks
UI-TARS-2 achieves state-of-the-art performance on diverse GUI and game benchmarks, consistently outperforming both prior open-source models and proprietary baselines:
| Model | OSWorld | WAA | TB† | SB† | AndroidWorld | Online-Mind2Web | BC-zh | BC-en |
|---|---|---|---|---|---|---|---|---|
| UI-TARS-1.5 | 42.5 | 42.1 | ✗ | ✗ | 64.2 | 75.8 | ✗ | ✗ |
| UI-TARS-2 | 47.5 | 50.6 | 45.3† | 68.7† | 73.3 | 88.2 | 32.1 (50.5†) | 7.0 (29.6†) |
| OpenAI o3 | 42.9 | — | ✗ | ✗ | 52.5 | 71.0 | — | — |
| Claude-4 | 43.9 | — | 39.2 | 72.7 | — | — | 22.5 | 14.7 |
† Indicates enhanced action space with GUI-SDK. BrowseComp scores improve significantly (e.g., BC-en 7.0→29.6; BC-zh 32.1→50.5) with GUI-SDK extensions.
In the 15-game suite, UI-TARS-2 RL attains mean normalized scores of 59.8 (human = 100), surpassing OpenAI CUA (24.7) and Claude CU (21.6), with several games exceeding 90% human-level. On LMGame-Bench, UI-TARS-2 remains competitive with OpenAI o3 and Gemini 2.5 Pro (Wang et al., 2 Sep 2025, Qin et al., 21 Jan 2025).
7. Generalization Potential and Evolutionary Roadmap
UI-TARS-2 demonstrates robust generalization to long-horizon, open-ended tasks, notably information-seeking (web browsing) and software engineering (shell, code bench tasks). RL fine-tuning elevates Online-Mind2Web accuracy from 83.7% (SFT) to 88.2% (RL); similar uplift observed in OSWorld and AndroidWorld. The system’s hybrid environment and thought-augmented policy contribute to multi-hop, API-free web search and multi-modal system administration capacities.
The UI-TARS roadmap outlines evolution from rule-based and modular architectures to fully native end-to-end agents. Stage 4 research focuses on lifelong adaptation, autonomous task proposal, robust self-supervision for novel UI patterns, and sample-efficient System-2 reasoning. This trajectory suggests continued advancement toward active agents capable of dynamic self-evaluation and adaptation in live computing environments (Qin et al., 21 Jan 2025, Wang et al., 2 Sep 2025).