Reliability of language agents in creating real economic value in professional settings

Determine whether contemporary language-model-based agents can reliably create economic value in real-world professional environments, beyond performance on exam-style benchmarks, by consistently delivering expert-level work products that meet domain constraints and professional standards.

Background

The paper argues that existing evaluation suites are largely exam-style or highly structured, and many are reaching saturation, leaving open the question of whether agentic systems translate to dependable value creation in real professional workflows.

To address this gap, the authors propose OneMillion-Bench, which emphasizes economically consequential tasks requiring retrieval, evidence-grounded reasoning, and adherence to domain-specific constraints. The open question motivates the benchmark’s goal of measuring whether agent performance equates to tangible, reliable professional outputs.

References

As traditional benchmarks reach saturation, it remains fundamentally unclear whether today’s agents can reliably create value in economically valued, professional environments.

\$OneMillion-Bench: How Far are Language Agents from Human Experts?  (2603.07980 - Yang et al., 9 Mar 2026) in Section 1, Introduction