Can computer-use agents perform real professional work?

Determine whether current computer-use agents (graphical user interface agents that operate software via mouse and keyboard) can successfully execute real professional workflows that are long-horizon and heterogeneous across diverse software configured with domain-specific data.

Background

The paper motivates Gym-Anything and CUA-World by noting that existing benchmarks mostly cover short-horizon tasks in a small set of consumer-grade applications, which do not reflect the long, complex workflows of professional settings. This raises uncertainty about whether present-day computer-use agents can handle realistic, economically relevant work.

By constructing thousands of tasks across 200 software applications with long horizons and realistic data, the authors aim to provide a testbed to study this open question empirically, but they explicitly acknowledge that the question itself remains unresolved.

References

Yet whether these agents can handle real professional work remains an open question.

Gym-Anything: Turn any Software into an Agent Environment  (2604.06126 - Aggarwal et al., 7 Apr 2026) in Introduction (Section 1)