Autonomy of LLMs for production-scale, specification-driven software construction

Determine whether large language models can autonomously build production-scale software systems from explicit specifications.

Background

The paper introduces SWE-AGI, a benchmark designed to evaluate whether LLM-based agents can construct complex software systems from authoritative specifications under a fixed scaffold. This directly targets the gap between demonstrated coding skills and full, end-to-end system implementation from requirements.

By leveraging MoonBit’s nascent ecosystem and spec-first workflow, the benchmark minimizes retrieval-based shortcuts and emphasizes long-horizon reasoning and architectural consistency. The opening of the abstract frames the overarching, field-level uncertainty about the capability of LLMs to meet this standard of autonomy.

References

Although LLMs have demonstrated impressive coding capabilities, their ability to autonomously build production-scale software from explicit specifications remains an open question.