Papers
Topics
Authors
Recent
Search
2000 character limit reached

Terminal-Bench 2.0: AI Agent Benchmark

Updated 21 January 2026
  • Terminal-Bench 2.0 is a benchmark that evaluates AI agents on high-skill, long-horizon command-line tasks using realistic simulated environments.
  • It comprises 89 diverse tasks across 10 technical domains, each with Dockerized setups, detailed instructions, and pytest-based verification for reproducibility.
  • The benchmark provides quantitative metrics such as resolution rates and cost analysis, driving advances in long-context memory, toolchain interoperation, and creative reasoning.

Terminal-Bench 2.0 is a rigorously curated benchmark designed to evaluate the capabilities of AI agents on high-skill, long-horizon tasks encountered in professional command-line environments. Comprised of 89 distinct tasks across ten technical domains, it establishes a fair, reproducible, and outcome-driven standard for assessing agentic LLMs and associated agent frameworks on the complex workflows that underpin real-world computing. Each task encapsulates a unique environment, natural-language instruction, comprehensive pytest-based verification suite, and a human-oracle solution, enabling deterministic and empirically grounded evaluation of agent performance (Merrill et al., 17 Jan 2026).

1. Core Design Principles

Terminal-Bench 2.0 operationalizes four foundational principles:

  • Realism: Each task replicates actual professional workflows (e.g., scientific computing, debugging, security engineering) within containers preconfigured with the precise files, dependencies, and system state a human expert would require.
  • Difficulty and Diversity: The benchmark’s 89 tasks reflect a broad spectrum of complexity. 8% are expected to be completed by junior engineers in under an hour, 72% within one workday, and 4% over a week. Tasks span ten high-level domains, with no single category predominating, ensuring agents must generalize across diverse technical landscapes.
  • Outcome-Driven Specification: Completion is solely determined by passing deterministic pytest tests that evaluate the final container state (files, outputs, performance metrics, cryptographic properties), without regard for intermediate agent actions or console transcript. Containers are specified via Dockerfiles, bundled instructions, and accompanying oracle solution scripts; agents are unconstrained in strategy.
  • Rigorous Verification: Inclusion of each task mandates multiple automated and manual checks: canary test, static analysis of specifications, LLM-assisted audits, code review, smoke testing via “dummy agents,” and adversarial audits. On average, >3 person-hours per task are invested in confirming: specification specificity (tests pass iff state is correct), solvability (oracle passes tests), and anti-triviality (no circumventing shortcuts) (Merrill et al., 17 Jan 2026).

2. Task Taxonomy and Environments

Task authors assign problems to ten categories:

Category Number of Tasks Example Problems
Software Engineering 30 Compile legacy code, debug schemes
Scientific Computing 12 DNA primer generation, simulation model optimization
Security 9 Cryptanalysis (FEAL), binary reverse engineering
System Administration 8 Linux-module bootstrapping, file recovery
File Operations 7 Filesystem manipulation, git secret recovery
Mathematics 6 Metacircular evaluators, regular expression synthesis
Data Science 6 Data wrangling, statistical analysis
Machine Learning 5 Embedding model evaluation, ML pipeline management
Games 3 Chess move enumeration, MuJoCo optimization
Personal Assistant 3 Automated scheduling, productivity scripting

Environments are instantiated as Docker containers with all requisite resources and dependencies pinned, enabling uniformity and isolation across trials. The use of legacy compilers, complex toolchains, third-party libraries, and adversarial challenge construction (such as regex generation or filter bypasses) ensures broad coverage of command-line expertise. Each task directory includes a task.yaml specification, a Dockerfile for environment reproduction, a tests/ folder housing deterministic pytest scripts, and solution.sh, a human-authored oracle solution (Merrill et al., 17 Jan 2026).

3. Evaluation Harness and Agent Integration

Terminal-Bench 2.0 tasks are registered in the Harbor repository and executed inside Daytona containers using the open-source Harbor harness. The evaluation integrates six agent frameworks:

  • Proprietary: Claude Code, Codex CLI, Gemini CLI
  • Open-source: OpenHands, Mini-SWE-Agent, Terminus 2 (“minimal headless” agent)

Agents interact with sixteen LLM backends, including proprietary frontier models (GPT-5.2, Claude Opus 4.5, Gemini 3 Pro) and open-weight baselines (Qwen 3, Minimax M2, Kimi K2 Thinking). Each agent-model pairing is run on every task at least five times, yielding a corpus of 32,155 trials. Metrics recorded include:

  • Resolution Rate: Fraction of trials where all tests pass before agent timeout (typically 60 minutes, up to two hours for certain tasks)
  • 95% Confidence Interval (CI): Calculated using the Wilson score method
  • Resource Use: Runtime, episode count, API calls (for Terminus 2), token in/out statistics
  • Cost Analysis: API expenditure per trial, Pareto frontier for cost-success optimization

4. Quantitative Results and Difficulty Calibration

Empirical findings indicate that frontier agent/model combinations remain challenged by the benchmark:

  • GPT-5.2+Codex CLI achieves the highest resolution rate at 62.9% ± 3.0%
  • Terminus 2+Claude Opus 4.5: 57.8% ± 2.5%
  • Terminus 2+Gemini 3 Pro: 56.9% ± 2.5%
  • Open-weight leaders lag (Terminus 2+Kimi K2 Thinking: 35.7% ± 2.8%)
  • Smaller models (e.g., GPT-5-Nano) rarely exceed 11%

Resolution rate correlates strongly with model capability and also depends on the agent orchestration; for instance, Gemini 2.5 Pro’s pass rate improves 17% with Terminus 2 scaffolding over OpenHands. Notably, Pareto trade-offs persist: maximum performance can require API costs of tens of dollars per run, with mid-tier open-weight models achieving 20–30% success for <$5.

Difficulty is classified empirically:

  • Easy: Resolution rate ≥ 2/3
  • Medium: 1/3 ≤ Resolution rate < 2/3
  • Hard: Resolution rate < 1/3

Comparison with human expert predictions yields Pearson correlation $r=0.436, p<0.001$ and demonstrates high concordance (93.3% of “human-Hard” tasks are empirically hard), but tasks labeled “Medium” by humans often prove empirically hard for agents, especially when creative or adversarial reasoning is required (Merrill et al., 17 Jan 2026).

5. Failure Taxonomy and Error Analysis

Failures are categorized at two granularities:

Trajectory-Level Failures (Multi-Agent System Taxonomy adapted for single-agent analysis):

  • Execution Errors (~50%): Disobeying explicit requirements (“must”/“shall”), redundant step repetition, wrongly extending or prematurely terminating workflows
  • Coherence Errors (~25%): Context loss (e.g., forgetting file edits), reasoning-action mismatch, premature completion before objectives satisfied
  • Verification Errors (~25%): Inadequate validation (incorrect/no verification), reliance on superficial/passable checks without guaranteeing requirement fulfillment

Annotation with Docent and human inspection of failed trials reveals these trends are consistent across top-performing frontier models; open-weight models express more heterogeneous error signatures.

Command-Level Failures: Each agent command yields an exit code and output. LLM-as-judge mechanisms (agreement > 92%) classify 3,800 sampled errors into a 100-leaf taxonomy. Dominant issues include:

  • “Command not found / missing executable” (~24%)
  • “Executable error at runtime” (~9.6%)

Model-specific failure rates vary: Grok 4 observes 9.2% command failures, while GPT-OSS-120B exhibits 26.7% (Merrill et al., 17 Jan 2026).

6. Open Challenges and Prospective Directions

Terminal-Bench 2.0 surfaces several unsolved challenges:

  • Complex System Builds: Multi-stage compilation tasks (e.g., build-pov-ray, fix-ocaml-gc) highlight persistent difficulties in toolchain management, patching, and linkage resolution.
  • Long-Horizon Coherence: Agents frequently lose context over extended workflows, mishandling file changes or output dependencies.
  • Robust Verification: Despite available test suites, agents may bypass or weakly engage validation procedures.
  • Adversarial and Creative Reasoning: Tasks requiring filter bypasses, regular expression synthesis for chess move generation, or cryptanalysis demand abstraction beyond typical pattern recognition.

Potential technical remedies proposed include:

  • Enhanced “chain-of-thought” planning for managing filesystem state and long-sequence memory
  • Built-in verification loops (execute-verify-revise paradigms)
  • Specialized plugins for domain-specific tools (e.g., make, Git, package managers)
  • Automated environment introspection to minimize invocation and path errors

Algorithmically, the benchmark relies on Wilson-method confidence intervals, resolution rate thresholds for difficulty, and standard LLM operators (arg max\argmax, arg min\argmin) within task instructions.

A plausible implication is that progress in agentic LLMs measured against Terminal-Bench 2.0 will necessitate advances in long-context memory, toolchain interoperation, and creative reasoning modules to realize robust, autonomous technical co-workers. The reproducibility and specificity of the benchmark, coupled with rigorous coverage across authentic workflows, position it as a foundational framework for future AI agent evaluation (Merrill et al., 17 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Terminal-Bench 2.0.