Papers
Topics
Authors
Recent
Search
2000 character limit reached

ByteDance UI-TARS System

Updated 26 January 2026
  • ByteDance UI-TARS System is a unified native agent framework that integrates vision, reasoning, and action to autonomously interact with GUIs across desktop, mobile, and web platforms.
  • It employs a Mixture-of-Experts Transformer and vision-language model to process screenshots and GUI traces, enabling multi-turn reinforcement learning with explicit System-2 reasoning.
  • Iterative data generation through reflective trace bootstrapping, coupled with a hybrid GUI-SDK environment and advanced RL stabilization techniques, drives its state-of-the-art performance and scalability.

ByteDance’s UI-TARS System is an end-to-end native agent framework designed for autonomous interaction with graphical user interfaces (GUIs), unifying visual perception, reasoning, action, and memory within a single policy. The system, comprising UI-TARS and its successor UI-TARS-2, represents a major milestone in scalable, multi-turn reinforcement learning (RL) for GUI agents. It departs from prompt-centric, commercial model wrappers, instead relying on a @@@@1@@@@ backbone and a vision-LLM that directly processes screenshots and GUI traces. The technical architecture leverages a unified action space, advanced System-2 reasoning, and iterative data generation through reflection and reinforcement, targeting robust performance across desktop, mobile, and web platforms (Wang et al., 2 Sep 2025, Qin et al., 21 Jan 2025).

1. Architectural Foundations and Unified Policy Loop

UI-TARS-2 deploys a native agent formulation built on a Mixture-of-Experts Transformer that incorporates a 532M-parameter vision encoder. The policy πθ\pi_\theta operates over multimodal observations, memory, and task instructions: πθ(rt,atinstr,Wt,ot,Et)\pi_\theta(r_t,a_t \mid \mathrm{instr},\mathcal W_t,o_t,\mathcal E_t) where each timestep tt includes an internal reasoning token rtr_t, action ata_t (GUI or SDK operation), and new observation ot+1o_{t+1}, with joint trajectory

τ={(r0,a0,o1),(r1,a1,o2),,(rT,aT,oT+1)}\tau = \{(r_0,a_0,o_1), (r_1,a_1,o_2), \ldots, (r_T,a_T,o_{T+1})\}

Memory consists of high-fidelity working memory Wt\mathcal W_t (last NN steps) and episodic memory Et\mathcal E_t (compressed summaries of past episodes), enabling persistent context across long-horizon interactions. This architecture obviates modular action-perception pipelines, supporting unified end-to-end learning for perception, action grounding, and system-level reasoning (Wang et al., 2 Sep 2025).

2. Data Flywheel and Iterative Trace Bootstrapping

UI-TARS introduces an iterative data flywheel, comprising continual pre-training (CT) on diverse corpora, supervised fine-tuning (SFT) from high-quality human and LLM-annotated traces, and reinforcement learning (RL) on verifiable tasks. After each training cycle tt, new trajectories are validated:

  • High-quality samples (V(s)=1V(s) = 1) enter DSFT(t+1)D_{SFT}^{(t+1)}
  • Lower-quality samples are appended to DCT(t+1)D_{CT}^{(t+1)}

This mechanism ensures that

P(V(s)=1M(t))>P(V(s)=1M(t1))P(V(s) = 1 \mid M^{(t)}) > P(V(s) = 1 \mid M^{(t-1)})

enabling accelerated capability gains. UI-TARS further employs reflective online trace bootstrapping by deploying agents on large virtual device pools, multi-stage filtering (heuristics, VLM scoring, human review), and error correction through direct preference optimization (DPO), where both positive and negative action traces are used for fine-tuned RL (Qin et al., 21 Jan 2025, Wang et al., 2 Sep 2025).

3. System-2 Reasoning and Thought-Augmented Planning

Both UI-TARS and UI-TARS-2 embed explicit System-2 reasoning mechanisms to support complex multi-step decision making. Intermediate "thought" tokens tit_i are generated to encode reasoning patterns such as task decomposition, milestone recognition, reflection, and trial-and-error. For UI-TARS: P(tn,aninstr,(onN,tnN,anN),,on)P(t_n, a_n \mid \mathrm{instr}, (o_{n-N}, t_{n-N}, a_{n-N}), \ldots, o_n) The ActRe annotation protocol generates thoughts causally aligned to actions: tn=VLM(instr,(o1,t1,a1),,on,an)t_n = \mathrm{VLM}(\mathrm{instr}, (o_1, t_1, a_1), \ldots, o_n, a_n) During thought bootstrapping, candidate thought–action pairs are sampled from an early checkpoint, selecting those matching the gold action: {(t^n,i,a^n,i)}i=1K=UI-TARSearly(),t^n=t^n,i s.t. a^n,i=an\{(\hat t_{n,i}, \hat a_{n,i})\}_{i=1}^K = \mathrm{UI\text{-}TARS}_{\mathrm{early}}(\ldots),\quad \hat t_n = \hat t_{n,i^*} \text{ s.t. } \hat a_{n,i^*} = a_n This explicit reasoning is foundational for long-horizon generalization, error correction, and adaptive planning (Qin et al., 21 Jan 2025, Wang et al., 2 Sep 2025).

4. Reinforcement Learning and Stability Mechanisms

UI-TARS-2 incorporates a stabilized multi-turn RL framework, centered on Proximal Policy Optimization (PPO) with several enhancements:

  • Decoupled-GAE: independent λ\lambda for policy and value functions to manage bias/variance
  • Length-Adaptive GAE: λpolicy=11α\lambda_{\mathrm{policy}} = 1 - \frac{1}{\alpha \ell}, where α=0.05\alpha = 0.05
  • Value Pretraining: offline convergence with λ=1.0\lambda = 1.0 using Monte Carlo targets
  • Clip Higher: distinct upper/lower ε\varepsilon for PPO clipping to encourage exploration
  • Reward shaping and length/format penalties

Asynchronous rollouts, streaming training pools, and stateful environments further stabilize RL in long-horizon scenarios. Analysis demonstrates steady reward and entropy ascent, reduction in reasoning tokens for GUI tasks, efficient interaction rounds, and effective value pretraining. Quantization (W4A8) achieves a \sim40% latency reduction (from 4.0s to 2.5s) with minimal accuracy loss (Wang et al., 2 Sep 2025).

5. Hybrid GUI-SDK Environment and Sandbox Infrastructure

UI-TARS-2’s environment design extends beyond pure GUI interaction by integrating a hybrid GUI-SDK backend. Agents can:

  • Perform on-screen operations
  • Access file systems via shell commands
  • Invoke SDK tools (e.g., REST API, MCP tools)
  • Utilize embedded development environments (terminals, VS Code/Jupyter previews, proxy URLs)

This enables workflows such as immediate post-download file processing and broadens applicability to software engineering and system administration tasks. The sandbox infrastructure uses:

  • Cloud VM Sandbox: orchestrating thousands of Windows/Ubuntu/Android VMs
  • Browser Sandbox: deterministic, GPU-accelerated containers for WebGL/HTML5 mini-games

Provisioning supports millions of rollouts with real-time monitoring, auto-recovery, and consistent tool APIs (Wang et al., 2 Sep 2025).

6. Empirical Performance across Benchmarks

UI-TARS-2 achieves state-of-the-art performance on diverse GUI and game benchmarks, consistently outperforming both prior open-source models and proprietary baselines:

Model OSWorld WAA TB† SB† AndroidWorld Online-Mind2Web BC-zh BC-en
UI-TARS-1.5 42.5 42.1 64.2 75.8
UI-TARS-2 47.5 50.6 45.3† 68.7† 73.3 88.2 32.1 (50.5†) 7.0 (29.6†)
OpenAI o3 42.9 52.5 71.0
Claude-4 43.9 39.2 72.7 22.5 14.7

† Indicates enhanced action space with GUI-SDK. BrowseComp scores improve significantly (e.g., BC-en 7.0→29.6; BC-zh 32.1→50.5) with GUI-SDK extensions.

In the 15-game suite, UI-TARS-2 RL attains mean normalized scores of 59.8 (human = 100), surpassing OpenAI CUA (24.7) and Claude CU (21.6), with several games exceeding 90% human-level. On LMGame-Bench, UI-TARS-2 remains competitive with OpenAI o3 and Gemini 2.5 Pro (Wang et al., 2 Sep 2025, Qin et al., 21 Jan 2025).

7. Generalization Potential and Evolutionary Roadmap

UI-TARS-2 demonstrates robust generalization to long-horizon, open-ended tasks, notably information-seeking (web browsing) and software engineering (shell, code bench tasks). RL fine-tuning elevates Online-Mind2Web accuracy from 83.7% (SFT) to 88.2% (RL); similar uplift observed in OSWorld and AndroidWorld. The system’s hybrid environment and thought-augmented policy contribute to multi-hop, API-free web search and multi-modal system administration capacities.

The UI-TARS roadmap outlines evolution from rule-based and modular architectures to fully native end-to-end agents. Stage 4 research focuses on lifelong adaptation, autonomous task proposal, robust self-supervision for novel UI patterns, and sample-efficient System-2 reasoning. This trajectory suggests continued advancement toward active agents capable of dynamic self-evaluation and adaptation in live computing environments (Qin et al., 21 Jan 2025, Wang et al., 2 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ByteDance UI-TARS System.