Capabilities Needed for Deploying Agents in Extended Real-World Workflows

Identify the specific capabilities required for deploying AI agent systems in real-world settings such that they can complete tasks independently or effectively support humans across extended workflows.

Background

The paper distinguishes traditional single-step benchmarks from agentic, multi-step evaluations and argues that agent benchmarks better probe deployment-relevant capabilities.

Despite this, the authors explicitly acknowledge that clarifying which capabilities are necessary for agents to operate autonomously or in augmentation roles across extended workflows remains an open question for real-world use.

References

Agent benchmarks therefore more directly probe the capabilities needed for deploying AI systems that can complete tasks independently or support humans across extended workflows—an important open question for real-world use.

How Well Does Agent Development Reflect Real-World Work?  (2603.01203 - Wang et al., 1 Mar 2026) in Appendix B, Agentic Benchmark Selection