MEnvBench: Polyglot Environment Benchmark
- MEnvBench is a polyglot benchmark designed to validate automated software environment construction pipelines across ten languages.
- It employs a dual-phase methodology combining environment reuse and a planning-execution-verification loop to ensure rigorous reproducibility and test accuracy.
- The benchmark offers detailed Docker-based artifacts, structured evaluation metrics, and reproducible processes for head-to-head comparisons with established baselines.
MEnvBench is a rigorously constructed, execution-validated, polyglot benchmark for verifiable software engineering environment generation and validation. It was introduced as the core evaluation suite for the MEnvAgent framework, with the explicit objective of enabling large-scale, cross-language benchmarking of automated environment-construction pipelines, especially for LLM agents in software engineering contexts (Guo et al., 30 Jan 2026). MEnvBench consists of 1,000 real-world bug-fix tasks across ten programming languages, emphasizing strict reproducibility, balanced task diversity, and robust verification protocols.
1. Dataset Construction and Composition
MEnvBench comprises 1,000 tasks derived from issues and patches in 200 popular GitHub repositories, spanning Python, Java, Go, JavaScript, TypeScript, Rust, C, C++, PHP, and Ruby (100 tasks per language via 20 repositories × 5 historical instances each). Selection criteria include a minimum of 1,000 stars and 200 forks per repository, at least 60% language purity, and linkage between closed issues and pull requests that provide both a fix patch and new tests. An LLM-based quality filter (requiring a score ≥5/10) ensures high-salience, verifiable instances.
Tasks are stratified along two discrete axes: domain (ten LLM-classified domains, e.g., Machine Learning, Web Applications, Embedded Systems) and project scale (five logarithmic size bands from <10 MB to >500 MB). This design enforces balanced sampling across technology and complexity, mitigating dataset skew. Each task includes the buggy and fixed repository snapshots, a Docker-based environment specification (base image and scripted build process), and a test configuration that applies the fix and invokes relevant test suites. Artifacts are provided in concise JSON schema, versioned alongside source and test variants.
2. Environment Construction Methodology
The environment construction pipeline comprises two synergistic phases:
A. Environment Reuse Mechanism:
From a maintained pool of previously constructed environments, the most similar historical environment is selected using an adaptation cost metric . Exact repository matches are prioritized, falling back to nearest newer snapshots if needed. If direct reuse fails verification, an EnvPatchAgent generates an incremental patch such that the new environment passes the test protocol for but fails for . Reuse achieves up to 39% success on Python tasks with ten historical instances.
B. Planning-Execution-Verification (PEV) Loop:
- Planning: A Repository Analysis Agent extracts project meta-data, build files, and project type. The Environment Setup Agent selects a base Docker image and generates installation script . The Test Configuration Agent aligns test commands with installed binaries to yield .
- Execution: Environment Execution Agent builds and launches , executes , monitors process outputs, and adapts commands responsively.
- Verification: The Verification Agent runs the test configuration against both the buggy snapshot and the fixed snapshot in the built environment . Strict validation requires test failures on and pass on . Error diagnoses feed back to the planning stage for correction.
This architecture enables robust, iterative refinement with strict enforcement of verifiability properties.
3. Evaluation Metrics and Protocol
Three principal metrics quantify agent and pipeline efficacy on MEnvBench:
- Pass Rate (PASS):
Measures the fraction of tasks where the fixed version passes all tests.
- Fail-to-Pass Rate (F2P):
Captures strict validity: only tasks which fail on the buggy version and succeed on the fix in the same environment.
- Average Time Cost (TIME):
Mean wall-clock seconds per environment-construction and validation; efficiency improvements are reported as relative reduction versus a baseline.
Table: Summary of Average Results on MEnvBench (all languages)
| Method | F2P (%) | PASS (%) | TIME (s) |
|---|---|---|---|
| SWE-Factory | 29.8 | 38.0 | 6266 |
| MEnvAgent | 38.4 (+8.6) | 49.0 (+11.0) | 3574 (-43%) |
F2P and time reductions hold across all ten languages, with maximal time savings of 74% in TypeScript and substantial F2P increases (up to +22 percentage points) in Python.
4. Data Products and Artifacts
For every task in MEnvBench, the following artifacts are provided:
- Docker-based environment definitions, specifying base image and build scripts.
- Reproducible build process scripts (), as ordered shell command lists.
- Test configuration scripts () for bug/fix discrimination.
- JSON manifest with repository metadata, build/test scripts, and pointers to buggy/fix versions.
Artifacts are whitespace-delimited or JSON-formatted and directly consumable by environment management tooling. Full datasets (1,000 tasks and all supporting scripts) are released under a permissive open-source license, together with practical automation for Kubernetes-based orchestration and dependency caching.
5. Results, Comparative Baselines, and Reproducibility
MEnvBench underpins the evaluation of MEnvAgent—comprising both the Environment Reuse Mechanism and the PEV pipeline—against three established baselines: Repo2Run, SWE-Bench-Live, and SWE-Factory. Tests are conducted using both open-source (Kimi-K2) and proprietary (Gemini-3-Flash) LLM backbones. MEnvAgent yields an 8.6 percentage point F2P increase and 43% mean time reduction over baselines. Per-language data indicate improvements in both strict validity and efficiency across diverse language ecosystems.
All code, Docker manifests, evaluation metrics, and the entire benchmark are published for reproducibility (https://github.com/ernie-research/MEnvAgent), supporting direct pipeline reuse and extension.
6. Significance and Impact
MEnvBench establishes a high-precision, multi-language, execution-validated reference point for the study of autonomous environment construction and verifiable software engineering at scale. By constraining benchmark design to real-world issues, tightly filtered patches, and reproducibility-driven Docker environments, it enables head-to-head evaluation of agentic and heuristic pipelines with strict pass/fail semantics. The inclusion of rigorous F2P and efficiency metrics supports empirical claims of both validity and practical deployability. The benchmark’s artifacts and protocols have catalyzed the release of MEnvData-SWE, the largest open polyglot verifiable environment dataset to date, further advancing reproducibility and comparability in the field (Guo et al., 30 Jan 2026).
7. Limitations and Scope
MEnvBench’s composition—ten languages, 1,000 tasks—prioritizes breadth and diversity. However, the selection logic enforces a fixed 20-repository, five-snapshot quota per language, which may introduce sampling artifacts with respect to underrepresented ecosystems or edge-case project types. Strict verifiability is gated by patch/test linkage quality and the effectiveness of LLM-based filtering (score ≥5/10). Computational reproducibility presupposes a Docker-capable Linux host, and time/cost statistics pertain to this reference architecture. Environment reuse yields highest efficiency in languages/repositories with deep version histories or similar build systems; its generality to niche stacks is plausible but unproven within this benchmark.
MEnvBench is positioned as a foundational resource for the experimental study of automated environment construction, continuous validation, and scalable agentic software engineering across language boundaries.