Reliability of language agents in creating real economic value in professional settings
Determine whether contemporary language-model-based agents can reliably create economic value in real-world professional environments, beyond performance on exam-style benchmarks, by consistently delivering expert-level work products that meet domain constraints and professional standards.
References
As traditional benchmarks reach saturation, it remains fundamentally unclear whether today’s agents can reliably create value in economically valued, professional environments.
— \$OneMillion-Bench: How Far are Language Agents from Human Experts?
(2603.07980 - Yang et al., 9 Mar 2026) in Section 1, Introduction