OpenCUA-32B Agent Model
- OpenCUA-32B Agent Model is a state-of-the-art, open-source vision-language agent that automates complex computer-use tasks with chain-of-thought reasoning.
- It utilizes a 64-layer decoder-only transformer and multi-image history inputs through robust vision-language adapters for precise UI grounding.
- The model sets new performance baselines by outperforming proprietary systems on OSWorld benchmarks using a two-stage supervised training pipeline with reflective CoT supervision.
OpenCUA-32B Agent Model is a state-of-the-art, open-source, 32-billion-parameter vision-language agent model designed to automate complex computer-use tasks through reflexive chain-of-thought reasoning and robust cross-operating-system generalization. Engineered as part of the OpenCUA project, OpenCUA-32B is distinguished by its foundation in high-fidelity, cross-platform human demonstration data, a specialized prompt design explicitly encoding multi-level reasoning traces, and a performance-driven training curriculum. The model establishes new open-source baselines for computer-use agent (CUA) performance, particularly on the OSWorld-Verified benchmark, surpassing prominent closed models under comparable settings (Wang et al., 12 Aug 2025).
1. Foundation and Architecture
OpenCUA-32B leverages Qwen2.5-VL-32B as its backbone, implementing a decoder-only transformer architecture characterized by 64 layers, a hidden dimension of , and 32 attention heads per layer. The model employs 1D rotary positional embeddings (RoPE) for sequence encoding and incorporates a multi-image ViT encoder as its vision frontend, enabling tokenization of each screenshot into patch embeddings. To facilitate robust cross-modal alignment, lightweight vision-language adapters are introduced between the vision encoder and transformer stack. OpenCUA-32B aggregates three temporal screenshots (three-image history window) to improve UI grounding across frames.
Distinct from web-agent approaches that use accessibility trees or DOM parsing, OpenCUA-32B strictly relies on full-screen screenshot inputs; the entire UI state is visually encoded. Modular mixture-of-data heads are used, with grounding-specialized and reasoning-specialized blocks emerging due to data-type specific curricular scheduling.
| Component | Specification | Origin |
|---|---|---|
| Backbone Transformer | 64 layers, , 32 heads | Qwen2.5-VL-32B |
| Vision encoder | Multi-image ViT patch encoder | Qwen2.5-VL |
| Positional encoding | 1D RoPE | -- |
| Adapter modules | Vision-language adapters | OpenCUA |
| History window | 3 images | OpenCUA |
The model maintains a strictly vision-language interface for both state and action spaces, with no reliance on internal app structures or programmatic APIs.
2. Data and Annotation Infrastructure
OpenCUA-32B is supported by AgentNet, an extensive annotation infrastructure designed to capture high-fidelity human-computer interaction data. The AgentNet tool operates cross-platform (Windows, macOS, Ubuntu) and unobtrusively records full-screen video frames (2 fps for keyframes), mouse/keyboard actions, and accessibility (Axtree) snapshots.
The AgentNet dataset comprises over 22,000 human-annotated trajectories, spanning 140 applications and 190 websites, with an average task length of 18.6 steps. Action labeling reduces raw low-level input streams to 12 atomic PyAutoGUI primitives (click, write, hotkey, etc.) using rule-based compression and state/action backtracking algorithms. Each trajectory undergoes manual and automated review (including GPT-4o pre-filtering) to enforce privacy and annotation integrity. Evaluation ground truths are constructed by offering multi-choice “gold” actions for each step (AgentNetBench).
3. Training Pipeline and Supervision Objective
The core training paradigm is a two-stage supervised process, transforming demonstration trajectories into structured state-action pairs augmented with reflective chain-of-thought (CoT) supervision. Each pair is associated with a hierarchical L3→L2→L1 reasoning trace:
- L3 (Observation): Salient visual and contextual description.
- L2 (Thought): Reflective reasoning and planning (step-by-step progress, error detection, and correction).
- L1 (Action): Concise next-action description.
The supervised loss function is:
CoT traces are generated and verified through a pipeline involving a “reflector” (error detection), “generator” (context-conditioned trace production), and “summarizer” (language refinement and scoring). The global curriculum proceeds in two stages: grounding (focused on UI perception, 35B tokens) followed by planning (action and reasoning emphasis, 60B tokens), each with specific learning rate and batch size configurations. A global-to-local mixture, including general-domain SFT data, improves robustness.
4. Reflective Chain-of-Thought (CoT) Reasoning and History
A key innovation is the use of multilevel, reflective CoT reasoning for every agent step. L3-L2-L1 traces are incorporated upstream in the token sequence through compact prompt templates. This explicitly encodes observer, thinker, and action labels, improving both interpretability and error recovery.
Reflection augmentation protocol involves automatic error screening (skipping flawed action-state pairs), full-context CoT synthesis, and calibration through reinforcement (e.g., skipping bad steps, step-wise language summarization). Empirically, use of L2 or mixed L1-L2-L3 CoT at inference increases task success rates relative to shallow (L1-only) traces (e.g., 18.5% for L2, 17.6% for L3-CoT vs. 16.9% for L1-only on 15-step OSWorld tasks). Advanced reflective CoT further increases performance (e.g., from 11.5% to 15.3% on Qwen2-VL-7B).
History representation experiments demonstrate that stacking more screenshots increases online success rates (from 6.5% with one screenshot to 9.9% with five on OSWorld), though diminishing returns are observed beyond three.
5. Performance Evaluation
OpenCUA-32B establishes new state-of-the-art results for open-source CUA models on the principal online benchmark OSWorld-Verified, with success rates of 29.7% (15 steps), 34.1% (50 steps), and 34.8% (100 steps). This surpasses the OpenAI CUA (GPT-4o) under equivalent conditions and closes in on proprietary leaders such as Claude 4 Sonnet. Pass@3 at 50 steps reaches 45.6%, demonstrating substantial gains from iterative sampling and reranking.
AgentNetBench (offline):
| Model | Coord. SR | Content SR | Func. SR | Avg. SR |
|---|---|---|---|---|
| Qwen2.5-VL-32B | 66.6% | 47.2% | 41.5% | 64.8% |
| OpenAI CUA | 71.7% | 57.3% | 80.0% | 73.1% |
| OpenCUA-7B | 79.0% | 62.0% | 44.3% | 75.2% |
| OpenCUA-32B | 81.9% | 66.1% | 55.7% | 79.1% |
OpenCUA-32B also outperforms all open-source baselines on GUI grounding (e.g., 59.6% on OSWorld-G, 93.4% on Screenspot-V2), indicating superior vision-based localization and action grounding.
6. Generalization, Ablation, and Insights
Experiments demonstrate that CUA performance scaling is primarily contingent on three factors: expanded demonstration data, CoT supervision depth, and multi-image history. Out-of-domain generalization is robust: adding out-of-domain data to the training corpus yields 9% or higher absolute improvements in target success rates. Data scaling from 3K to 14K for Windows/Mac tasks results in ~125% gain in average SR.
Test-time compute ablations show that increased sampling and reranking (e.g., Pass@N) yield over 100% relative headroom beyond single-sample inference. Incorporating textual history (L1 monologue) during inference is more effective for long contexts than denser L2 history.
Identified limitations include annotation scaling bottlenecks, privacy-driven bias in demonstration collection, and incomplete error recovery for extremely long-horizon tasks. Open challenge domains are automated error supervision, tighter symbolic-perceptual data integration, and multi-agent inference/ranking.
7. Research Context and Future Directions
OpenCUA-32B was developed by the OpenCUA research group as an open, reproducible foundation for computer-use agents (Wang et al., 12 Aug 2025). Its comparative achievements over proprietary CUA systems and established open baselines make it a reference point for further studies in agentic VLMs, data-to-action architectures, and chain-of-thought training protocols. The release of code, datasets, annotation tools, and models at https://opencua.xlang.ai is intended to support continued progress, comparative benchmarking, and downstream specialization (e.g., in hybrid-action or tool-augmented agents as exemplified by UltraCUA (Yang et al., 20 Oct 2025)).
A plausible implication is that future improvements will require advances in reflective self-supervision, larger and higher-quality demonstration datasets, and richer integration of symbolic GUI metadata. The field is trending toward hybrid action and tool-based generalization, as agentic capabilities surpass straightforward vision-language modeling.