Papers
Topics
Authors
Recent
Search
2000 character limit reached

ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents

Published 19 Aug 2025 in cs.AI | (2508.14040v1)

Abstract: We introduce ComputerRL, a framework for autonomous desktop intelligence that enables agents to operate complex digital workspaces skillfully. ComputerRL features the API-GUI paradigm, which unifies programmatic API calls and direct GUI interaction to address the inherent mismatch between machine agents and human-centric desktop environments. Scaling end-to-end RL training is crucial for improvement and generalization across diverse desktop tasks, yet remains challenging due to environmental inefficiency and instability in extended training. To support scalable and robust training, we develop a distributed RL infrastructure capable of orchestrating thousands of parallel virtual desktop environments to accelerate large-scale online RL. Furthermore, we propose Entropulse, a training strategy that alternates reinforcement learning with supervised fine-tuning, effectively mitigating entropy collapse during extended training runs. We employ ComputerRL on open models GLM-4-9B-0414 and Qwen2.5-14B, and evaluate them on the OSWorld benchmark. The AutoGLM-OS-9B based on GLM-4-9B-0414 achieves a new state-of-the-art accuracy of 48.1%, demonstrating significant improvements for general agents in desktop automation. The algorithm and framework are adopted in building AutoGLM (Liu et al., 2024a)

Summary

  • The paper introduces a hybrid API-GUI action paradigm that outperforms traditional GUI-only methods by combining API calls with direct GUI operations.
  • It employs a scalable, distributed RL infrastructure with asynchronous training and a multi-stage pipeline, including the novel Entropulse strategy, to sustain exploration.
  • The framework achieves 48.1% success on OSWorld, representing a 64% performance improvement over behavior cloning baselines and significant efficiency gains.

ComputerRL: Scalable End-to-End RL for Autonomous Computer Use Agents

Introduction

The paper introduces ComputerRL, a comprehensive framework for training autonomous agents to operate complex desktop environments using end-to-end reinforcement learning (RL). The framework addresses the inefficiencies and limitations of prior approaches—primarily behavior cloning (BC) and GUI-only RL—by integrating a hybrid API-GUI action space, a scalable distributed RL infrastructure, and a novel entropy-preserving training strategy (Entropulse). The system is evaluated on the OSWorld benchmark, where it achieves a new state-of-the-art success rate of 48.1% with the AutoGLM-OS-9B agent, demonstrating substantial improvements in both efficiency and generalization over previous models.

The ComputerRL Framework

API-GUI Action Paradigm

A central innovation is the API-GUI paradigm, which unifies programmatic API calls with direct GUI actions. This hybrid action space allows agents to bypass the inefficiencies of human-centric GUI operations by leveraging machine-oriented APIs when available, while retaining the flexibility to fall back on GUI actions for tasks not covered by APIs. The framework automates the construction of application-specific APIs using LLMs, which analyze user-provided task exemplars, generate interface definitions, implement APIs via application libraries, and validate them through automated test cases. This automation significantly reduces the manual effort required to extend agent capabilities to new applications. Figure 1

Figure 1: The @@@@4@@@@ integrates API-GUI actions, large-scale parallel desktop environments, and asynchronous RL for efficient agent training.

Scalable Distributed RL Infrastructure

ComputerRL addresses the resource intensiveness and instability of prior desktop RL environments by introducing a containerized, multi-node Ubuntu cluster managed via Docker and gRPC. The system supports thousands of concurrent virtual desktop instances, each exposing a standardized AgentBench-compatible API. This design enables high-throughput, reproducible experimentation and efficient resource utilization, with a web-based controller for real-time monitoring and debugging.

Fully Asynchronous RL Training

The RL training pipeline is built on the AgentRL framework, which decouples data collection from policy updates via asynchronous execution. Actors, critics, and replay buffers operate independently, with dynamic batch sizing and off-policy bias mitigation through limited buffer sizes and frequent policy synchronization. This architecture maximizes hardware utilization and accelerates convergence, supporting large-scale experiments that were previously infeasible.

Training Methodology

Multi-Stage Training Pipeline

The training process consists of three stages:

  1. Behavior Cloning (BC) Cold Start: The agent is initialized via supervised learning on trajectories collected from multiple LLMs, ensuring a diverse and high-quality base policy.
  2. Step-Level Group Relative Policy Optimization (GRPO): RL is performed using a step-level extension of GRPO, with explicit, verifiable, rule-based rewards assigned to each action based on automated task completion checks.
  3. Entropulse: To counteract entropy collapse and premature convergence, RL alternates with supervised fine-tuning (SFT) on successful rollouts, restoring policy entropy and sustaining exploration. Figure 2

    Figure 2: ComputerRL training pipeline: BC initialization, step-level GRPO RL, and Entropulse for entropy recovery.

Entropulse: Sustaining Exploration

Empirical results show that standard RL quickly leads to entropy collapse, limiting further policy improvement. Entropulse leverages the diversity of successful rollouts from different training stages to construct SFT datasets, periodically fine-tuning the policy to restore entropy and exploration capacity. This alternation enables continued performance gains in extended RL runs. Figure 3

Figure 3: Training curves show that Entropulse (red) restores entropy and enables further reward improvement compared to reference resetting alone (grey).

Experimental Results

State-of-the-Art Performance on OSWorld

AutoGLM-OS-9B, trained with ComputerRL, achieves a 48.1% success rate on OSWorld, outperforming all prior open and proprietary models, including OpenAI CUA o3 (42.9%), UI-TARS-1.5 (42.5%), and Claude Sonnet 4 (30.7%). The RL phase provides a 64% performance gain over BC alone. The API-GUI paradigm enables the agent to complete tasks with at most one-third the number of steps required by GUI-only baselines, highlighting the efficiency benefits of programmatic control. Figure 4

Figure 4

Figure 4: Success rates of agents on OSWorld, with ComputerRL-based agents achieving the highest performance.

Qualitative Task Execution

The agent demonstrates robust performance across diverse desktop tasks, including multi-application workflows, system monitoring, document formatting, and image manipulation. Case studies illustrate successful execution of complex, long-horizon tasks requiring coordination across multiple applications and modalities. Figure 5

Figure 5: AutoGLM-OS executing representative user tasks involving image processing, system monitoring, spreadsheet calculation, and document formatting.

Error Analysis

Failure cases are categorized into visual perception errors, multi-application coordination failures, operational illusions, and other errors. The most frequent errors involve multi-application coordination (34.4%) and vision (25.8%), indicating areas for future improvement in perception and cross-application reasoning.

Ablation and Analysis

Ablation studies confirm the critical contributions of both the API-GUI paradigm and the multi-stage training pipeline. The API-GUI approach yields a 134% improvement over GUI-only baselines, with the largest gains in office and professional domains. The Entropulse phase is essential for maintaining exploration and enabling further RL improvements after initial convergence.

Implications and Future Directions

Practical Impact

ComputerRL establishes a scalable, extensible foundation for training generalist computer use agents capable of robust, efficient operation in real-world desktop environments. The hybrid action space and distributed infrastructure enable rapid adaptation to new applications and workflows, while the training methodology ensures sustained policy improvement.

Theoretical Implications

The results demonstrate that end-to-end RL, when combined with entropy-preserving strategies and hybrid action spaces, can overcome the limitations of BC and GUI-only RL in complex, human-centric environments. The explicit, verifiable reward design and step-level GRPO formulation provide a template for RL in other structured, multi-step domains.

Future Research

Key directions include enhancing multimodal perception for improved visual grounding, developing hierarchical planning for long-horizon autonomy, and establishing robust safety and alignment protocols for agents with broad system access. Expanding the diversity and scale of training data, as well as integrating real-world user feedback, will be critical for achieving universal, adaptive desktop agents.

Conclusion

ComputerRL advances the state of the art in autonomous desktop agents by integrating a hybrid API-GUI action space, scalable distributed RL infrastructure, and entropy-preserving training. The framework achieves strong empirical results on OSWorld, demonstrating both efficiency and generalization. This work provides a robust foundation for future research in persistent, generalist computer use agents and intelligent human-computer interaction.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

ComputerRL: Teaching AI to Use a Computer Like a Skilled Assistant

1. What is this paper about?

This paper is about building an AI “computer helper” that can use a desktop computer on its own—opening apps, clicking buttons, typing, editing files, and more—to complete real tasks. The authors introduce a system called ComputerRL that trains these helpers to work better and faster by practicing on thousands of virtual computers at the same time.

2. What questions did the researchers ask?

The team focused on three big questions:

  • How can an AI control a computer efficiently when the screen and mouse are designed for humans, not machines?
  • How can we train the AI at scale so it gets better across many different apps and tasks?
  • How can we keep the AI improving during long training runs, instead of getting stuck doing the same thing over and over?

3. How did they do it?

The researchers built a complete training setup with three key ideas. Here’s the gist, using everyday comparisons:

  • API-GUI: two ways to control the computer
    • GUI actions are like moving a mouse and typing on a keyboard.
    • APIs are like special remote-control commands that apps understand directly.
    • The paper combines both: the AI uses the GUI when needed (like a human) and uses APIs when they’re faster and safer (like shortcuts). They even use AI to help auto-build these APIs for apps by analyzing examples and generating code and tests.
  • A giant practice room made of virtual computers
    • They run thousands of “pretend” Ubuntu desktops (virtual machines) in parallel—like setting up a massive gym full of practice stations.
    • These desktops are organized with efficient tools (Docker, gRPC) so training is stable and fast.
    • A web dashboard lets them monitor what’s happening and keep everything running smoothly.
  • How the AI learns: first copy, then practice with feedback
    • Behavior Cloning (BC): First, the AI learns by copying good examples—like a student watching and imitating experts. They collect many successful task recordings using several strong LLMs, filter for success, and fine-tune the student model.
    • Reinforcement Learning (RL): Next, the AI practices and gets feedback—like a coach giving points for good plays. They use clear, automatic “checkers” to decide if a task was done correctly (verifiable rewards).
    • Step-level GRPO: A training method that assigns credit step by step inside each task. Imagine scoring each move in a game, not just the final win or loss. This helps the AI learn which exact actions helped.
  • Entropulse: keeping the AI curious so it doesn’t get stuck
    • “Entropy” here means how much the AI explores different options. After long RL training, the AI can get too predictable and stop exploring—that’s bad for learning.
    • Entropulse fixes this by switching between RL and short bursts of supervised fine-tuning (SFT) using the AI’s own successful past attempts. It’s like taking a break to review the best plays, which refreshes the AI’s curiosity and helps it try new strategies again.

4. What did they find, and why is it important?

Main results:

  • New state of the art on OSWorld, a tough benchmark for desktop tasks
    • Their 9B-parameter model (AutoGLM-OS-9B, based on GLM-4-9B-0414) reaches 48.1% success, beating strong systems like OpenAI’s CUA o3 (42.9%).
    • A 14B-parameter version (with Qwen2.5-14B) also performs very well (~45.8%).
  • Big improvements from the API-GUI combo
    • Using both APIs and GUI actions beats GUI-only by a large margin, especially in complex office and professional tasks.
    • The agent often needs only about one-third as many steps to finish a task compared to other approaches—so it’s more efficient.
  • Scaling up training works
    • Their large, stable virtual desktop cluster lets them train faster and more reliably.
    • Entropulse keeps the AI exploring longer, boosting final performance compared to training without it.

Why this matters:

  • It shows we can train general-purpose computer agents that actually work across many apps and workflows.
  • The approach makes desktop automation more practical, accurate, and efficient.

5. What’s the impact and what comes next?

This research pushes AI closer to being a trustworthy digital assistant that can:

  • Help with real office work (documents, spreadsheets, images).
  • Coordinate across multiple apps smoothly.
  • Learn new tools over time.

The authors also point to future needs:

  • Robustness: handling new apps, pop-ups, and unusual cases reliably.
  • Long-horizon autonomy: managing long, multi-step projects from start to finish.
  • Safety and alignment: adding strong permission systems and checks so the agent acts safely when it can access files or sensitive data.

In short, ComputerRL shows a practical path to training AI that can use computers like skilled assistants—faster, safer, and smarter—by mixing better ways to act (API + GUI), bigger and more stable practice setups (many virtual desktops), and smarter training (Entropulse to keep learning going).

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 577 likes about this paper.