LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS

Published 24 May 2025 in cs.AI, cs.HC, and cs.OS | (2505.18829v1)

Abstract: We present AIOS 1.0, a novel platform designed to advance computer-use agent (CUA) capabilities through environmental contextualization. While existing approaches primarily focus on building more powerful agent frameworks or enhancing agent models, we identify a fundamental limitation: the semantic disconnect between how LLMs understand the world and how computer interfaces are structured. AIOS 1.0 addresses this challenge by transforming computers into contextual environments that LLMs can natively comprehend, implementing a Model Context Protocol (MCP) server architecture to abstract computer states and actions. This approach effectively decouples interface complexity from decision complexity, enabling agents to reason more effectively about computing environments. To demonstrate our platform's effectiveness, we introduce LiteCUA, a lightweight computer-use agent built on AIOS 1.0 that achieves a 14.66% success rate on the OSWorld benchmark, outperforming several specialized agent frameworks despite its simple architecture. Our results suggest that contextualizing computer environments for LLMs represents a promising direction for developing more capable computer-use agents and advancing toward AI that can interact with digital systems. The source code of LiteCUA is available at https://github.com/agiresearch/LiteCUA, and it is also integrated into the AIOS main branch as part of AIOS at https://github.com/agiresearch/AIOS.

Abstract PDF Upgrade to Chat

Summary

The paper introduces AIOS 1.0’s innovative MCP server that transforms computer states into semantic contexts for LLM-based agents.
It employs a minimalist orchestrator-worker architecture with a perceive-reason-act cycle to optimize task management in computer-use agents.
Evaluation shows LiteCUA achieving a 14.66% success rate on the OSWorld benchmark, underscoring its performance gains over complex systems.

LiteCUA: Computer as MCP Server for Computer-Use Agent on AIOS

Introduction

The paper introduces AIOS 1.0, a platform designed to enable LLMs to seamlessly interact with computer-use agents (CUAs) by contextualizing computing environments. This approach aims to bridge the semantic disconnect between LLMs and the typically rigid structures of computer interfaces. Through the development of AIOS 1.0, computers are transformed into contextual environments that LLMs can understand, thus decoupling interface complexity from decision complexity.

Figure 1: Illustration of the differences between how human operates computer and how computer-use agent (CUA) operates computer, where CUA requires a specific contextualized environment to understand the original computer environment and take actions.

Architecture and System Design

AIOS 1.0 builds on previous AIOS systems by incorporating a Model Context Protocol (MCP) server architecture. It abstracts computer states and actions into semantic representations that LLMs can effectively utilize. The architecture consists of multiple layers:

Application Layer: Provides APIs for agents to interact with components such as terminals and browsers, creating a unified interaction surface.
Kernel Layer: Enhances the AIOS kernel with tools like a VM Controller and MCP Server, facilitating a sandboxed environment for agent interaction.

This architecture transforms computers into semantic landscapes that align with LLM reasoning, facilitating long-term strategic planning.

Figure 2: The serving architecture of AIOS 1.0 for computer-use agent, which extends on the basis of AIOS 0.x versions.

Contextualizing Computers as MCP Servers

The core innovation of AIOS 1.0 is its ability to transform traditional computing environments into interactive contexts for LLMs:

Environment Perception Framework: A multi-modal sensing approach that enriches the computer's state with semantic information via screenshots and accessibility trees.
Action Space Semantics: Encapsulates essential operations like Click, Scroll, and Drag into higher-level commands compatible with the semantic understanding of LLMs.

By decoupling interface operations from cognitive reasoning, AIOS 1.0 facilitates more refined and efficient task management.

Figure 3: Pipeline for agents to take actions to interact with VM environments.

LiteCUA: Implementation on AIOS 1.0

LiteCUA, developed as an application on AIOS 1.0, leverages the contextual environment framework to simplify agent design and enhance performance on the OSWorld benchmark.

Orchestrator-Worker Architecture: Employs a decentralized model separating task planning from execution, thereby optimizing task workflows.
Perceive-Reason-Act Cycle: A structured process where the agent perceives its environment, reasons through potential actions, and then acts on these decisions.

This minimalist architecture allows LiteCUA to achieve a 14.66% success rate on the OSWorld benchmark, outperforming more complex systems despite its simplicity.

Figure 4: Design of LiteCUA. We adopt a simple orchestrator-worker architecture, where orchestrator is responsible for planning and assigning tasks and other workers deal with specialized duties like perceiving, reasoning and acting.

Evaluation

In a controlled environment, LiteCUA demonstrated significant promise, especially in tasks involving straightforward interactions like operating systems (OS). However, it faced challenges with more complex applications requiring intricate interface manipulation. The varying performance underscores the importance of enriched semantic processing in agent systems.

Current researches in CUAs have placed emphasis on harnessing multimodal data for better interfacing with complex environments, much like the methodology showcased by LiteCUA. Further development in AIOS infrastructures is pivotal in advancing from tool-specific to more generalized applications of these systems.

Conclusion

AIOS 1.0 and LiteCUA set forth a compelling argument for contextualizing computational environments to optimize LLM-based agent performance. Future work will likely explore refining perception frameworks and expanding application domains to fully harness this contextualization strategy. Transforming computer interfaces into environments that LLMs can naturally navigate represents a pivotal shift in advancing general-purpose AI capabilities.