MLE-bench Evaluation Suite

Updated 21 November 2025

MLE-bench Evaluation Suite is a benchmarking framework that systematically assesses the machine learning engineering capabilities of autonomous agents using a curated set of Kaggle competitions.
It enforces reproducible protocols with defined task complexities, tiered medal-based metrics, and strict hardware and runtime restrictions to ensure fair comparisons.
The framework integrates advanced search policies, operator scaffolding, and evaluation techniques such as Monte Carlo Tree Search and evolutionary search to drive automated ML engineering progress.

MLE-bench Evaluation Suite establishes a rigorous and large-scale benchmarking framework to systematically assess the ML engineering capabilities of autonomous agents, particularly those powered by LLMs. Drawing from real‐world Kaggle competitions, MLE-bench evaluates agents across diverse ML tasks, imposing reproducible protocols, hardware constraints, and human-relevant metrics. Its design integrates competitive leaderboard-based success criteria, robust operator/scaffold abstractions for agent code generation, and detailed evaluation practices, positioning it as a critical testbed for progress in automated ML engineering (Chan et al., 2024, &&&1&&&).

1. Benchmark Construction and Task Design

MLE-bench comprises a curated corpus of 75 completed Kaggle competitions spanning a spectrum of ML engineering challenges, with an additional 7 held out for development purposes (Chan et al., 2024). The suite stratifies tasks by:

Problem Category: Image classification, NLP, time-series forecasting, tabular regression, segmentation, signal processing, and multimodal learning.
Complexity Level:
- Low (22/75): Solvable in <2 hours by an expert (excluding model training).
- Medium (38/75): 2–10 hours.
- High (15/75): >10 hours.

Each competition is framed as a supervised learning problem. Formally, for task $k \in T = \{1, ..., K\}$ , the agent is given an input space $X_k \subseteq \mathbb{R}^{d_k}$ , an output space $Y_k$ , and a dataset split $D_k = (D^\text{train}_k, D^\text{val}_k, D^\text{test}_k)$ . The task’s performance metric $m_k(\hat{Y}, Y)$ is chosen to match the original Kaggle problem setting and normalized to $[0,1]$ where possible (Toledo et al., 3 Jul 2025).

Agents interact with these tasks through code execution and experiment automation—reading raw files in various formats, preprocessing data, designing models, hyperparameter tuning, and handling long-running scripts with robust error management.

2. Evaluation Metrics and Success Criteria

MLE-bench evaluation adopts a medal-based system mirroring Kaggle leaderboards. For each competition, bronze, silver, and gold thresholds ( $\tau_k^{\text{bronze}} \leq \tau_k^{\text{silver}} \leq \tau_k^{\text{gold}}$ ) are established based on private leaderboard ranks using a tiered lookup scheme (Chan et al., 2024). The core performance indicator is the "any-medal" rate:

$\mathrm{AnyMedalRate} = \frac{1}{K}\sum_{k=1}^K I_k(g),$

where $I_k(g) = 1[f(g) \geq \tau_k^{\text{bronze}}]$ and $f(g)$ denotes the metric value on $D^\text{test}_k$ (Toledo et al., 3 Jul 2025).

Multiple independent runs per task (seeds) are performed to estimate the probability of medalling. The task-level success rate is

$\mathrm{SR}_k = \frac{1}{N} \sum_{i=1}^N I_k(g_i),$

and the aggregate success rate is

$\mathrm{SR} = \frac{1}{K} \sum_{k=1}^K \mathrm{SR}_k.$

To quantify agent robustness, the pass@k metric is employed:

$\mathrm{pass}@k = 1 - \frac{\binom{n - c}{k}}{\binom{n}{k}},$

where $c$ is the count of medalling runs out of $n$ (Chan et al., 2024). Confidence intervals are computed via stratified bootstrapping over tasks and seeds (Toledo et al., 3 Jul 2025).

3. Agent Scaffolding, Search Policies, and Operator Design

MLE-bench evaluates agent architectures that automate iterative ML solution development by formalizing them as search policies over the solution space. Each candidate solution ("artifact") is represented as a node $v$ in a directed search graph $G_t = (V_t, E_t)$ , with edges corresponding to operator-induced transformations (Toledo et al., 3 Jul 2025). The framework is parametrized by:

$F$ : Fitness function (validation set performance, 5-fold cross-validation),
$\pi_\text{sel}$ : Node selection policy,
$O = \{o_\ell\}$ : Operator set (e.g., Draft, Debug, Improve, Memory, Crossover),
$\pi_\text{op}$ : Operator selection policy,
$\tau$ : Termination criterion (time/node budget).

Three principal search policies are instantiated:

Greedy (AIDE): Always selects the highest-fitness node, applies Draft until initial drafts exist, then Improve, falling back to Debug.
Monte Carlo Tree Search (MCTS): UCT-guided selection,

$h_{\text{UCT}}(v|u) = Q(v) + c \sqrt{\frac{\log N(u)}{N(v)+\epsilon}}$

with leaf nodes evaluated by $F$ , and value/backpropagation correspond to MCTS conventions.

Evolutionary Search: Fitness-proportional parent selection, with reproduction via Improve or Crossover, and debugging applied as needed; offspring replace lowest-fitness individuals.

Operator sets are critical: O_AIDE (baseline) and O_AIRA (enhanced) are compared. Notable O_AIRA features include dynamic prompt complexity cues, scoped memory, and "think tokens" for structured reasoning (Toledo et al., 3 Jul 2025).

4. Experimental Protocol and Infrastructure

MLE-bench enforces strict reproducibility and compute constraints:

Dataset access: Only train and validation data are available during search. The test set is used solely for final evaluation.
Search execution: Agent code is run in isolated Apptainer (OCI) containers, with access to a full ML stack superimage. Each run is limited to 24 hours wall time per task, with code snippets capped at 4 hours runtime.
Hardware per sandbox: 1×H200 GPU, 24 CPUs, 100 GB RAM, 1 TB local storage (Toledo et al., 3 Jul 2025).
LLM access: Self-hosted or rate-limited API services.
Result reporting: Main analyses are based on ≥10 seeds per task (20 preferred) with stratified bootstrap CIs documented.
Frequent checkpointing and infrastructure logs enable fault tolerance (mean time to failure ≈1000 h).

The benchmark supports multiple agent scaffolds (AIDE, MLAB, OpenHands) and LLMs (DeepSeek R1, OpenAI o3, OpenAI o1-preview, GPT-4o, Claude-3.5, Llama-3.1-405B). Agents receive a standardized ~700-token system prompt with competition meta-data and all required resource paths (Chan et al., 2024).

5. Main Results, Analyses, and Insights

Performance on MLE-bench is summarized by the "any-medal" rate. Notable results include:

AIDE + o1-preview: 16.9% ± 1.1 points (pass@1)
AIDE + GPT-4o: 8.7% ± 0.5 points (pass@1)
Performance can double with pass@k: o1-preview reaches ~34% for pass@6.
Scaling runtime to 100 hours confers incremental gains but rapidly plateaus (Chan et al., 2024).

Key observations:

Operator Bottleneck: With baseline operators (O_AIDE), search policy choice is largely ineffectual—operator expressivity is the limiting factor. Enhanced operators (O_AIRA) with strong search strategies yield substantial performance improvements, raising the medalling rate from 39.6% to 47.7% on MLE-bench lite (Toledo et al., 3 Jul 2025).
Generalization Gap: Agents frequently overfit to validation metrics. Oracle selection based on test metrics exposes a 9–13% gap; selecting the top $k=3$ nodes by validation and reporting the best addresses most of this gap.
Variance: Large numbers of seeds are essential to avoid misleading rankings; fewer than 5 seeds per task can yield unstable results.
Time-Dependence: Agent rankings evolve over the 24-hour window; non-greedy policies converge and occasionally surpass greedy strategies only after $\sim$ 10–19 hours (Toledo et al., 3 Jul 2025).
Hardware Scaling: No clear advantage is observed for additional GPU resources, as CPU-only settings achieve comparable medal rates.

Contamination and plagiarism analyses indicate near-zero correlation between LLM familiarity with competition pages and agent performance, and no substantial code overlap with top public notebooks (Chan et al., 2024).

6. Open Source Artifacts and Reproducibility Guidelines

The entire MLE-bench suite, including datasets, grading scripts, agent scaffolding code, and evaluation harnesses, is open-sourced (https://github.com/openai/mle-bench) (Chan et al., 2024). The repository includes:

/competitions/: Competition data loaders and grading code.
/agents/: Reference agents and scaffolds.
/scripts/: Utilities for split preparation, orchestration, and leaderboard snapshotting.
/eval/: Scoring harness, medal thresholds, and pass@k calculation.
Complete logs, seed repetitions, and code to detect rule violations/plagiarism.

To ensure reproducibility, all random seeds, container images, hardware specs, and LLM versions are catalogued in the infra/CONFIG.md file, with the medal-threshold logic accessible in eval/medals.py.

7. Contextualization and Extensions

Situating MLE-bench within the broader ML benchmarking landscape, PMLB provides an instructive comparison (Olson et al., 2017). PMLB emphasizes systematic dataset curation, meta-feature profiling (instance/feature counts, class imbalance, etc.), and standardized cross-validated evaluation pipelines—principles mirrored and extended in MLE-bench, but MLE-bench elevates the focus to autonomous ML engineering using real‑world, heterogeneous tasks. The meta-feature-aware dataset selection, version-controlled and fully transparent infrastructure, and integration of pass@k and anytime metrics collectively advance best practices for benchmarking complex agent behavior.

This suggests that future directions for MLE-bench may include simulation of missingness, expansion to regression and structured-data tasks, synthetic benchmarks targeting undercovered regions in the meta-feature space, and community-driven extension of the benchmark corpus—as outlined in best practices distilled from both MLE-bench and PMLB experience (Olson et al., 2017, Chan et al., 2024).