Papers
Topics
Authors
Recent
Search
2000 character limit reached

MEnvBench: Polyglot Environment Benchmark

Updated 3 February 2026
  • MEnvBench is a polyglot benchmark designed to validate automated software environment construction pipelines across ten languages.
  • It employs a dual-phase methodology combining environment reuse and a planning-execution-verification loop to ensure rigorous reproducibility and test accuracy.
  • The benchmark offers detailed Docker-based artifacts, structured evaluation metrics, and reproducible processes for head-to-head comparisons with established baselines.

MEnvBench is a rigorously constructed, execution-validated, polyglot benchmark for verifiable software engineering environment generation and validation. It was introduced as the core evaluation suite for the MEnvAgent framework, with the explicit objective of enabling large-scale, cross-language benchmarking of automated environment-construction pipelines, especially for LLM agents in software engineering contexts (Guo et al., 30 Jan 2026). MEnvBench consists of 1,000 real-world bug-fix tasks across ten programming languages, emphasizing strict reproducibility, balanced task diversity, and robust verification protocols.

1. Dataset Construction and Composition

MEnvBench comprises 1,000 tasks derived from issues and patches in 200 popular GitHub repositories, spanning Python, Java, Go, JavaScript, TypeScript, Rust, C, C++, PHP, and Ruby (100 tasks per language via 20 repositories × 5 historical instances each). Selection criteria include a minimum of 1,000 stars and 200 forks per repository, at least 60% language purity, and linkage between closed issues and pull requests that provide both a fix patch and new tests. An LLM-based quality filter (requiring a score ≥5/10) ensures high-salience, verifiable instances.

Tasks are stratified along two discrete axes: domain (ten LLM-classified domains, e.g., Machine Learning, Web Applications, Embedded Systems) and project scale (five logarithmic size bands from <10 MB to >500 MB). This design enforces balanced sampling across technology and complexity, mitigating dataset skew. Each task includes the buggy and fixed repository snapshots, a Docker-based environment specification (base image and scripted build process), and a test configuration that applies the fix and invokes relevant test suites. Artifacts are provided in concise JSON schema, versioned alongside source and test variants.

2. Environment Construction Methodology

The environment construction pipeline comprises two synergistic phases:

A. Environment Reuse Mechanism:

From a maintained pool Spool\mathcal{S}_{\mathrm{pool}} of previously constructed environments, the most similar historical environment SsimS_\mathrm{sim} is selected using an adaptation cost metric Cadapt(S,R)\mathcal{C}_{\mathrm{adapt}}(S, R). Exact repository matches are prioritized, falling back to nearest newer snapshots if needed. If direct reuse fails verification, an EnvPatchAgent generates an incremental patch ΔP\Delta \mathcal{P} such that the new environment Snew=δ(Ssim,ΔP)S_\mathrm{new} = \delta(S_\mathrm{sim}, \Delta \mathcal{P}) passes the test protocol for RfixR_\mathrm{fix} but fails for RR. Reuse achieves up to 39% success on Python tasks with ten historical instances.

B. Planning-Execution-Verification (PEV) Loop:

  • Planning: A Repository Analysis Agent extracts project meta-data, build files, and project type. The Environment Setup Agent selects a base Docker image BB and generates installation script P\mathcal{P}. The Test Configuration Agent aligns test commands with installed binaries to yield TT.
  • Execution: Environment Execution Agent builds and launches BB, executes P\mathcal{P}, monitors process outputs, and adapts commands responsively.
  • Verification: The Verification Agent runs the test configuration TT against both the buggy snapshot RR and the fixed snapshot RfixR_{\mathrm{fix}} in the built environment SS. Strict validation requires test failures on RR and pass on RfixR_{\mathrm{fix}}. Error diagnoses feed back to the planning stage for correction.

This architecture enables robust, iterative refinement with strict enforcement of verifiability properties.

3. Evaluation Metrics and Protocol

Three principal metrics quantify agent and pipeline efficacy on MEnvBench:

PASS=#{ tasks with ε(Rfix,S,T)=0}1000×100%\mathrm{PASS} = \frac{\#\{\,\text{tasks with }\varepsilon(R_{\mathrm{fix}},S,T)=0\}}{1000} \times 100\%

Measures the fraction of tasks where the fixed version passes all tests.

  • Fail-to-Pass Rate (F2P):

F2P=∣{tasks: ε(R,S,T)=1 ∧ ε(Rfix,S,T)=0}∣∣{tasks: ε(R,S,T)=1}∣\mathrm{F2P} = \frac{|\{\text{tasks: }\varepsilon(R,S,T)=1\ \land\ \varepsilon(R_{\mathrm{fix}},S,T)=0\}|}{|\{\text{tasks: }\varepsilon(R,S,T)=1\}|}

Captures strict validity: only tasks which fail on the buggy version and succeed on the fix in the same environment.

  • Average Time Cost (TIME):

Mean wall-clock seconds per environment-construction and validation; efficiency improvements are reported as relative reduction versus a baseline.

Table: Summary of Average Results on MEnvBench (all languages)

Method F2P (%) PASS (%) TIME (s)
SWE-Factory 29.8 38.0 6266
MEnvAgent 38.4 (+8.6) 49.0 (+11.0) 3574 (-43%)

F2P and time reductions hold across all ten languages, with maximal time savings of 74% in TypeScript and substantial F2P increases (up to +22 percentage points) in Python.

4. Data Products and Artifacts

For every task in MEnvBench, the following artifacts are provided:

  • Docker-based environment definitions, specifying base image and build scripts.
  • Reproducible build process scripts (P\mathcal{P}), as ordered shell command lists.
  • Test configuration scripts (TT) for bug/fix discrimination.
  • JSON manifest with repository metadata, build/test scripts, and pointers to buggy/fix versions.

Artifacts are whitespace-delimited or JSON-formatted and directly consumable by environment management tooling. Full datasets (1,000 tasks and all supporting scripts) are released under a permissive open-source license, together with practical automation for Kubernetes-based orchestration and dependency caching.

5. Results, Comparative Baselines, and Reproducibility

MEnvBench underpins the evaluation of MEnvAgent—comprising both the Environment Reuse Mechanism and the PEV pipeline—against three established baselines: Repo2Run, SWE-Bench-Live, and SWE-Factory. Tests are conducted using both open-source (Kimi-K2) and proprietary (Gemini-3-Flash) LLM backbones. MEnvAgent yields an 8.6 percentage point F2P increase and 43% mean time reduction over baselines. Per-language data indicate improvements in both strict validity and efficiency across diverse language ecosystems.

All code, Docker manifests, evaluation metrics, and the entire benchmark are published for reproducibility (https://github.com/ernie-research/MEnvAgent), supporting direct pipeline reuse and extension.

6. Significance and Impact

MEnvBench establishes a high-precision, multi-language, execution-validated reference point for the study of autonomous environment construction and verifiable software engineering at scale. By constraining benchmark design to real-world issues, tightly filtered patches, and reproducibility-driven Docker environments, it enables head-to-head evaluation of agentic and heuristic pipelines with strict pass/fail semantics. The inclusion of rigorous F2P and efficiency metrics supports empirical claims of both validity and practical deployability. The benchmark’s artifacts and protocols have catalyzed the release of MEnvData-SWE, the largest open polyglot verifiable environment dataset to date, further advancing reproducibility and comparability in the field (Guo et al., 30 Jan 2026).

7. Limitations and Scope

MEnvBench’s composition—ten languages, 1,000 tasks—prioritizes breadth and diversity. However, the selection logic enforces a fixed 20-repository, five-snapshot quota per language, which may introduce sampling artifacts with respect to underrepresented ecosystems or edge-case project types. Strict verifiability is gated by patch/test linkage quality and the effectiveness of LLM-based filtering (score ≥5/10). Computational reproducibility presupposes a Docker-capable Linux host, and time/cost statistics pertain to this reference architecture. Environment reuse yields highest efficiency in languages/repositories with deep version histories or similar build systems; its generality to niche stacks is plausible but unproven within this benchmark.

MEnvBench is positioned as a foundational resource for the experimental study of automated environment construction, continuous validation, and scalable agentic software engineering across language boundaries.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MEnvBench.