Predictiveness of performance on free alternatives for commercial counterparts

Determine to what extent computer-use agent performance on sandboxable, free alternatives included in CUA-World predicts performance on the corresponding proprietary, licensed commercial software used professionally within the same software categories.

Background

Due to licensing and access constraints, many professionally used applications cannot be sandboxed. The benchmark substitutes the closest free, self-hostable alternatives to maintain economic relevance while enabling open evaluation.

The external validity of this substitution remains uncertain: if performance on free alternatives does not predict performance on commercial counterparts, benchmark results may not generalize to real deployments. The authors explicitly flag this as an open question.

References

While we specifically select the closest sandboxable alternative for software that cannot be freely sandboxed (e.g., due to licensing), a large fraction of professionally used software remains excluded, and the degree to which performance on free alternatives predicts performance on their commercial counterparts is an open question.

Gym-Anything: Turn any Software into an Agent Environment  (2604.06126 - Aggarwal et al., 7 Apr 2026) in Limitations