ByteDance UI-TARS System

Updated 26 January 2026

ByteDance UI-TARS System is a unified native agent framework that integrates vision, reasoning, and action to autonomously interact with GUIs across desktop, mobile, and web platforms.
It employs a Mixture-of-Experts Transformer and vision-language model to process screenshots and GUI traces, enabling multi-turn reinforcement learning with explicit System-2 reasoning.
Iterative data generation through reflective trace bootstrapping, coupled with a hybrid GUI-SDK environment and advanced RL stabilization techniques, drives its state-of-the-art performance and scalability.

ByteDance’s UI-TARS System is an end-to-end native agent framework designed for autonomous interaction with graphical user interfaces (GUIs), unifying visual perception, reasoning, action, and memory within a single policy. The system, comprising UI-TARS and its successor UI-TARS-2, represents a major milestone in scalable, multi-turn reinforcement learning (RL) for GUI agents. It departs from prompt-centric, commercial model wrappers, instead relying on a Mixture-of-Experts Transformer backbone and a vision-LLM that directly processes screenshots and GUI traces. The technical architecture leverages a unified action space, advanced System-2 reasoning, and iterative data generation through reflection and reinforcement, targeting robust performance across desktop, mobile, and web platforms (Wang et al., 2 Sep 2025, Qin et al., 21 Jan 2025).

1. Architectural Foundations and Unified Policy Loop

UI-TARS-2 deploys a native agent formulation built on a Mixture-of-Experts Transformer that incorporates a 532M-parameter vision encoder. The policy $\pi_\theta$ operates over multimodal observations, memory, and task instructions: $\pi_\theta(r_t,a_t \mid \mathrm{instr},\mathcal W_t,o_t,\mathcal E_t)$ where each timestep $t$ includes an internal reasoning token $r_t$ , action $a_t$ (GUI or SDK operation), and new observation $o_{t+1}$ , with joint trajectory

$\tau = \{(r_0,a_0,o_1), (r_1,a_1,o_2), \ldots, (r_T,a_T,o_{T+1})\}$

Memory consists of high-fidelity working memory $\mathcal W_t$ (last $N$ steps) and episodic memory $\mathcal E_t$ (compressed summaries of past episodes), enabling persistent context across long-horizon interactions. This architecture obviates modular action-perception pipelines, supporting unified end-to-end learning for perception, action grounding, and system-level reasoning (Wang et al., 2 Sep 2025).

2. Data Flywheel and Iterative Trace Bootstrapping

UI-TARS introduces an iterative data flywheel, comprising continual pre-training (CT) on diverse corpora, supervised fine-tuning (SFT) from high-quality human and LLM-annotated traces, and reinforcement learning (RL) on verifiable tasks. After each training cycle $\pi_\theta(r_t,a_t \mid \mathrm{instr},\mathcal W_t,o_t,\mathcal E_t)$ 0, new trajectories are validated:

High-quality samples ( $\pi_\theta(r_t,a_t \mid \mathrm{instr},\mathcal W_t,o_t,\mathcal E_t)$ 1) enter $\pi_\theta(r_t,a_t \mid \mathrm{instr},\mathcal W_t,o_t,\mathcal E_t)$ 2
Lower-quality samples are appended to $\pi_\theta(r_t,a_t \mid \mathrm{instr},\mathcal W_t,o_t,\mathcal E_t)$ 3

This mechanism ensures that

$\pi_\theta(r_t,a_t \mid \mathrm{instr},\mathcal W_t,o_t,\mathcal E_t)$ 4

enabling accelerated capability gains. UI-TARS further employs reflective online trace bootstrapping by deploying agents on large virtual device pools, multi-stage filtering (heuristics, VLM scoring, human review), and error correction through direct preference optimization (DPO), where both positive and negative action traces are used for fine-tuned RL (Qin et al., 21 Jan 2025, Wang et al., 2 Sep 2025).

3. System-2 Reasoning and Thought-Augmented Planning

Both UI-TARS and UI-TARS-2 embed explicit System-2 reasoning mechanisms to support complex multi-step decision making. Intermediate "thought" tokens $\pi_\theta(r_t,a_t \mid \mathrm{instr},\mathcal W_t,o_t,\mathcal E_t)$ 5 are generated to encode reasoning patterns such as task decomposition, milestone recognition, reflection, and trial-and-error. For UI-TARS: $\pi_\theta(r_t,a_t \mid \mathrm{instr},\mathcal W_t,o_t,\mathcal E_t)$ 6 The ActRe annotation protocol generates thoughts causally aligned to actions: $\pi_\theta(r_t,a_t \mid \mathrm{instr},\mathcal W_t,o_t,\mathcal E_t)$ 7 During thought bootstrapping, candidate thought–action pairs are sampled from an early checkpoint, selecting those matching the gold action: $\pi_\theta(r_t,a_t \mid \mathrm{instr},\mathcal W_t,o_t,\mathcal E_t)$ 8 This explicit reasoning is foundational for long-horizon generalization, error correction, and adaptive planning (Qin et al., 21 Jan 2025, Wang et al., 2 Sep 2025).

4. Reinforcement Learning and Stability Mechanisms

UI-TARS-2 incorporates a stabilized multi-turn RL framework, centered on Proximal Policy Optimization (PPO) with several enhancements:

Decoupled-GAE: independent $\pi_\theta(r_t,a_t \mid \mathrm{instr},\mathcal W_t,o_t,\mathcal E_t)$ 9 for policy and value functions to manage bias/variance
Length-Adaptive GAE: $t$ 0, where $t$ 1
Value Pretraining: offline convergence with $t$ 2 using Monte Carlo targets
Clip Higher: distinct upper/lower $t$ 3 for PPO clipping to encourage exploration
Reward shaping and length/format penalties

Asynchronous rollouts, streaming training pools, and stateful environments further stabilize RL in long-horizon scenarios. Analysis demonstrates steady reward and entropy ascent, reduction in reasoning tokens for GUI tasks, efficient interaction rounds, and effective value pretraining. Quantization (W4A8) achieves a $t$ 440% latency reduction (from 4.0s to 2.5s) with minimal accuracy loss (Wang et al., 2 Sep 2025).

5. Hybrid GUI-SDK Environment and Sandbox Infrastructure

UI-TARS-2’s environment design extends beyond pure GUI interaction by integrating a hybrid GUI-SDK backend. Agents can:

Perform on-screen operations
Access file systems via shell commands
Invoke SDK tools (e.g., REST API, MCP tools)
Utilize embedded development environments (terminals, VS Code/Jupyter previews, proxy URLs)

This enables workflows such as immediate post-download file processing and broadens applicability to software engineering and system administration tasks. The sandbox infrastructure uses:

Cloud VM Sandbox: orchestrating thousands of Windows/Ubuntu/Android VMs
Browser Sandbox: deterministic, GPU-accelerated containers for WebGL/HTML5 mini-games

Provisioning supports millions of rollouts with real-time monitoring, auto-recovery, and consistent tool APIs (Wang et al., 2 Sep 2025).

6. Empirical Performance across Benchmarks

UI-TARS-2 achieves state-of-the-art performance on diverse GUI and game benchmarks, consistently outperforming both prior open-source models and proprietary baselines:

Model	OSWorld	WAA	TB†	SB†	AndroidWorld	Online-Mind2Web	BC-zh	BC-en
UI-TARS-1.5	42.5	42.1	✗	✗	64.2	75.8	✗	✗
UI-TARS-2	47.5	50.6	45.3†	68.7†	73.3	88.2	32.1 (50.5†)	7.0 (29.6†)
OpenAI o3	42.9	—	✗	✗	52.5	71.0	—	—
Claude-4	43.9	—	39.2	72.7	—	—	22.5	14.7

† Indicates enhanced action space with GUI-SDK. BrowseComp scores improve significantly (e.g., BC-en 7.0→29.6; BC-zh 32.1→50.5) with GUI-SDK extensions.

In the 15-game suite, UI-TARS-2 RL attains mean normalized scores of 59.8 (human = 100), surpassing OpenAI CUA (24.7) and Claude CU (21.6), with several games exceeding 90% human-level. On LMGame-Bench, UI-TARS-2 remains competitive with OpenAI o3 and Gemini 2.5 Pro (Wang et al., 2 Sep 2025, Qin et al., 21 Jan 2025).

7. Generalization Potential and Evolutionary Roadmap

UI-TARS-2 demonstrates robust generalization to long-horizon, open-ended tasks, notably information-seeking (web browsing) and software engineering (shell, code bench tasks). RL fine-tuning elevates Online-Mind2Web accuracy from 83.7% (SFT) to 88.2% (RL); similar uplift observed in OSWorld and AndroidWorld. The system’s hybrid environment and thought-augmented policy contribute to multi-hop, API-free web search and multi-modal system administration capacities.

The UI-TARS roadmap outlines evolution from rule-based and modular architectures to fully native end-to-end agents. Stage 4 research focuses on lifelong adaptation, autonomous task proposal, robust self-supervision for novel UI patterns, and sample-efficient System-2 reasoning. This trajectory suggests continued advancement toward active agents capable of dynamic self-evaluation and adaptation in live computing environments (Qin et al., 21 Jan 2025, Wang et al., 2 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (2)

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning (2025)

UI-TARS: Pioneering Automated GUI Interaction with Native Agents (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ByteDance UI-TARS System.