Terminal-Bench 2.0: AI Agent Benchmark
- Terminal-Bench 2.0 is a benchmark that evaluates AI agents on high-skill, long-horizon command-line tasks using realistic simulated environments.
- It comprises 89 diverse tasks across 10 technical domains, each with Dockerized setups, detailed instructions, and pytest-based verification for reproducibility.
- The benchmark provides quantitative metrics such as resolution rates and cost analysis, driving advances in long-context memory, toolchain interoperation, and creative reasoning.
Terminal-Bench 2.0 is a rigorously curated benchmark designed to evaluate the capabilities of AI agents on high-skill, long-horizon tasks encountered in professional command-line environments. Comprised of 89 distinct tasks across ten technical domains, it establishes a fair, reproducible, and outcome-driven standard for assessing agentic LLMs and associated agent frameworks on the complex workflows that underpin real-world computing. Each task encapsulates a unique environment, natural-language instruction, comprehensive pytest-based verification suite, and a human-oracle solution, enabling deterministic and empirically grounded evaluation of agent performance (Merrill et al., 17 Jan 2026).
1. Core Design Principles
Terminal-Bench 2.0 operationalizes four foundational principles:
- Realism: Each task replicates actual professional workflows (e.g., scientific computing, debugging, security engineering) within containers preconfigured with the precise files, dependencies, and system state a human expert would require.
- Difficulty and Diversity: The benchmark’s 89 tasks reflect a broad spectrum of complexity. 8% are expected to be completed by junior engineers in under an hour, 72% within one workday, and 4% over a week. Tasks span ten high-level domains, with no single category predominating, ensuring agents must generalize across diverse technical landscapes.
- Outcome-Driven Specification: Completion is solely determined by passing deterministic pytest tests that evaluate the final container state (files, outputs, performance metrics, cryptographic properties), without regard for intermediate agent actions or console transcript. Containers are specified via Dockerfiles, bundled instructions, and accompanying oracle solution scripts; agents are unconstrained in strategy.
- Rigorous Verification: Inclusion of each task mandates multiple automated and manual checks: canary test, static analysis of specifications, LLM-assisted audits, code review, smoke testing via “dummy agents,” and adversarial audits. On average, >3 person-hours per task are invested in confirming: specification specificity (tests pass iff state is correct), solvability (oracle passes tests), and anti-triviality (no circumventing shortcuts) (Merrill et al., 17 Jan 2026).
2. Task Taxonomy and Environments
Task authors assign problems to ten categories:
| Category | Number of Tasks | Example Problems |
|---|---|---|
| Software Engineering | 30 | Compile legacy code, debug schemes |
| Scientific Computing | 12 | DNA primer generation, simulation model optimization |
| Security | 9 | Cryptanalysis (FEAL), binary reverse engineering |
| System Administration | 8 | Linux-module bootstrapping, file recovery |
| File Operations | 7 | Filesystem manipulation, git secret recovery |
| Mathematics | 6 | Metacircular evaluators, regular expression synthesis |
| Data Science | 6 | Data wrangling, statistical analysis |
| Machine Learning | 5 | Embedding model evaluation, ML pipeline management |
| Games | 3 | Chess move enumeration, MuJoCo optimization |
| Personal Assistant | 3 | Automated scheduling, productivity scripting |
Environments are instantiated as Docker containers with all requisite resources and dependencies pinned, enabling uniformity and isolation across trials. The use of legacy compilers, complex toolchains, third-party libraries, and adversarial challenge construction (such as regex generation or filter bypasses) ensures broad coverage of command-line expertise. Each task directory includes a task.yaml specification, a Dockerfile for environment reproduction, a tests/ folder housing deterministic pytest scripts, and solution.sh, a human-authored oracle solution (Merrill et al., 17 Jan 2026).
3. Evaluation Harness and Agent Integration
Terminal-Bench 2.0 tasks are registered in the Harbor repository and executed inside Daytona containers using the open-source Harbor harness. The evaluation integrates six agent frameworks:
- Proprietary: Claude Code, Codex CLI, Gemini CLI
- Open-source: OpenHands, Mini-SWE-Agent, Terminus 2 (“minimal headless” agent)
Agents interact with sixteen LLM backends, including proprietary frontier models (GPT-5.2, Claude Opus 4.5, Gemini 3 Pro) and open-weight baselines (Qwen 3, Minimax M2, Kimi K2 Thinking). Each agent-model pairing is run on every task at least five times, yielding a corpus of 32,155 trials. Metrics recorded include:
- Resolution Rate: Fraction of trials where all tests pass before agent timeout (typically 60 minutes, up to two hours for certain tasks)
- 95% Confidence Interval (CI): Calculated using the Wilson score method
- Resource Use: Runtime, episode count, API calls (for Terminus 2), token in/out statistics
- Cost Analysis: API expenditure per trial, Pareto frontier for cost-success optimization
4. Quantitative Results and Difficulty Calibration
Empirical findings indicate that frontier agent/model combinations remain challenged by the benchmark:
- GPT-5.2+Codex CLI achieves the highest resolution rate at 62.9% ± 3.0%
- Terminus 2+Claude Opus 4.5: 57.8% ± 2.5%
- Terminus 2+Gemini 3 Pro: 56.9% ± 2.5%
- Open-weight leaders lag (Terminus 2+Kimi K2 Thinking: 35.7% ± 2.8%)
- Smaller models (e.g., GPT-5-Nano) rarely exceed 11%
Resolution rate correlates strongly with model capability and also depends on the agent orchestration; for instance, Gemini 2.5 Pro’s pass rate improves 17% with Terminus 2 scaffolding over OpenHands. Notably, Pareto trade-offs persist: maximum performance can require API costs of tens of dollars per run, with mid-tier open-weight models achieving 20–30% success for <$5.
Difficulty is classified empirically:
- Easy: Resolution rate ≥ 2/3
- Medium: 1/3 ≤ Resolution rate < 2/3
- Hard: Resolution rate < 1/3
Comparison with human expert predictions yields Pearson correlation $r=0.436, p<0.001$ and demonstrates high concordance (93.3% of “human-Hard” tasks are empirically hard), but tasks labeled “Medium” by humans often prove empirically hard for agents, especially when creative or adversarial reasoning is required (Merrill et al., 17 Jan 2026).
5. Failure Taxonomy and Error Analysis
Failures are categorized at two granularities:
Trajectory-Level Failures (Multi-Agent System Taxonomy adapted for single-agent analysis):
- Execution Errors (~50%): Disobeying explicit requirements (“must”/“shall”), redundant step repetition, wrongly extending or prematurely terminating workflows
- Coherence Errors (~25%): Context loss (e.g., forgetting file edits), reasoning-action mismatch, premature completion before objectives satisfied
- Verification Errors (~25%): Inadequate validation (incorrect/no verification), reliance on superficial/passable checks without guaranteeing requirement fulfillment
Annotation with Docent and human inspection of failed trials reveals these trends are consistent across top-performing frontier models; open-weight models express more heterogeneous error signatures.
Command-Level Failures: Each agent command yields an exit code and output. LLM-as-judge mechanisms (agreement > 92%) classify 3,800 sampled errors into a 100-leaf taxonomy. Dominant issues include:
- “Command not found / missing executable” (~24%)
- “Executable error at runtime” (~9.6%)
Model-specific failure rates vary: Grok 4 observes 9.2% command failures, while GPT-OSS-120B exhibits 26.7% (Merrill et al., 17 Jan 2026).
6. Open Challenges and Prospective Directions
Terminal-Bench 2.0 surfaces several unsolved challenges:
- Complex System Builds: Multi-stage compilation tasks (e.g., build-pov-ray, fix-ocaml-gc) highlight persistent difficulties in toolchain management, patching, and linkage resolution.
- Long-Horizon Coherence: Agents frequently lose context over extended workflows, mishandling file changes or output dependencies.
- Robust Verification: Despite available test suites, agents may bypass or weakly engage validation procedures.
- Adversarial and Creative Reasoning: Tasks requiring filter bypasses, regular expression synthesis for chess move generation, or cryptanalysis demand abstraction beyond typical pattern recognition.
Potential technical remedies proposed include:
- Enhanced “chain-of-thought” planning for managing filesystem state and long-sequence memory
- Built-in verification loops (execute-verify-revise paradigms)
- Specialized plugins for domain-specific tools (e.g., make, Git, package managers)
- Automated environment introspection to minimize invocation and path errors
Algorithmically, the benchmark relies on Wilson-method confidence intervals, resolution rate thresholds for difficulty, and standard LLM operators (, ) within task instructions.
A plausible implication is that progress in agentic LLMs measured against Terminal-Bench 2.0 will necessitate advances in long-context memory, toolchain interoperation, and creative reasoning modules to realize robust, autonomous technical co-workers. The reproducibility and specificity of the benchmark, coupled with rigorous coverage across authentic workflows, position it as a foundational framework for future AI agent evaluation (Merrill et al., 17 Jan 2026).