AgentBench: Evaluating LLM Agent Capabilities
- AgentBench is a multi-dimensional benchmark designed to assess LLM agents' reasoning, planning, and decision-making in interactive, text-based scenarios.
- It utilizes eight distinct environments in code, games, and web tasks, employing metrics like success rate, overall reward, and F1 score for comprehensive evaluation.
- The benchmark identifies failure modes such as task limit exceedance and invalid actions, providing actionable insights for improving agent-oriented LLM development.
AgentBench is a multi-dimensional benchmark designed to evaluate the reasoning, decision-making, planning, and action capabilities of LLMs when deployed as autonomous agents in interactive, text-based environments. Developed to address the inadequacies of single-turn, static NLP benchmarks, AgentBench measures agentic competence across diverse scenarios—requiring perception, multi-turn interaction, tool use, and adaptive decision-making. The suite comprises eight distinct environments spanning code, games, and web tasks, with rigorously defined protocols and metrics to assess both open-source and commercial LLMs under uniform conditions. AgentBench exposes disparities in current LLM capabilities, delineates failure modes, and supports actionable recommendations for agent-oriented LLM development, establishing a standardized platform for rigorous research into LLM-based agents (Liu et al., 2023).
1. Benchmark Scope and Environments
AgentBench evaluates LLMs-as-agents in eight environments partitioned into three categories:
A. Code-Grounded Tasks:
- Operating System (OS): Agents answer questions or perform live Ubuntu shell operations. Actions per turn are either bash snippets or answer submission, with up to 8 turns. Outputs are validated by scripted pipelines.
- Database (DB): Translation of natural instructions into SQL queries on MySQL tables. Up to 5 SQL statements/answers per interaction, evaluated via matching ground-truth outcomes.
- Knowledge Graph (KG): Multi-api Freebase KBQA (≥5 tool calls, e.g., get_relations, intersections). Up to 15 steps, with environment feedback as S-Expression results; metrics include final-answer F1 and exact match.
B. Game-Grounded Tasks:
- Digital Card Game (DCG): A text-based “Aquawar” game pitting the agent against a scripted baseline. Up to 30 moves, measured by win rate and combined reward (win/damage, weighted 0.7/0.3).
- Lateral Thinking Puzzles (LTP): Agents solve puzzles by posing yes/no/irrelevant questions to uncover “truth bullets” (key facts), with up to 25 rounds. Progress ratio denotes key-point recovery.
- House-Holding (HH): Simulation of prototypical ALFWorld tasks (e.g., “put a pan on the dining table”), with agents executing text commands in household scenarios for up to 35 turns.
C. Web-Grounded Tasks:
- Web Shopping (WS): Agents fulfill natural-language shopping requests on a simulated e-commerce site. Five turns maximum—reward is a normalized match of desired/found attributes.
- Web Browsing (WB): Multi-site web agent tasks (Mind2Web corpus), with actions as HTML element selection and operations (Click/Type/Select); step-level correctness is quantified as step success rate.
This diversity is intended to preclude success through modality-specific tricks or single-domain specialization (Liu et al., 2023).
2. Evaluation Protocols and Metrics
AgentBench incorporates multiple, precisely defined metrics tailored to each environment:
| Environment(s) | Primary Metric | Formula/Definition |
|---|---|---|
| OS, DB, HH | Success Rate (SR) | |
| KG QA | F1 Score | |
| DCG | Overall Reward | |
| LTP | Game Progress | |
| WS | WebShop Reward | (normalized attribute/option overlap, type-match factor) |
| WB | Step Success Rate |
To compute a single overall score, AgentBench normalizes each environment’s mean to 1, then averages across tasks with reciprocal-mean weights, mitigating domain score scale effects (Liu et al., 2023).
3. Key Experimental Findings
Extensive evaluation on 27 LLMs (commercial API and open-source models) revealed:
- Commercial LLMs outperform: GPT-4 (ver. 0613) achieved an overall score of 4.01, highest in 6/8 tasks (e.g., 78% SR on HH, 74.5% DCG win rate). Claude-2 and GPT-3.5-turbo followed at 2.49 and 2.32, respectively.
- Large gap for OSS models: Top open-source model CodeLlama-34B scored only 0.96—less than half GPT-3.5-turbo’s 2.32. Most open-source models clustered between 0.2–0.8, with especially poor performance (<10% SR) on KG, DCG, HH.
- Domain and task dependency: OSS models may match closed-source LLMs on static academic benchmarks, yet fall short on multi-step, interactive scenarios due to persistent weaknesses in chain-of-thought, extended decision sequences, and robust action selection.
This performance delta (mean 2.15 vs. 0.51 overall) highlights the continued advantage of proprietary models in practical agentic deployments (Liu et al., 2023).
4. Diagnosed Failure Modes
AgentBench systematically categorizes agent failures (termination reasons):
- Task Limit Exceeded (TLE): Predominant failure mode—agent reaches the maximum turn limit (e.g., 67.9% in KG, 82.5% in LTP), indicating poor long-term planning, frequent looping, or lack of backtracking.
- Invalid Format (IF): Notable in DB/DCG tasks (e.g., 53% DB IF), usually due to misformatted answers (e.g., missing SQL backticks, incorrect “Action:” labels), revealing instruction-following brittleness.
- Invalid Action (IA): Common where discrete action spaces exist (HH/WB), with agents often producing out-of-domain or mis-indexed actions.
- Context Limit Exceeded (CLE): Occurs infrequently, reflecting token window breaches on extended interactions.
- Instruction Omission and Repetition: Even best models occasionally omit required action markers or re-execute failing strategies (e.g., repeatedly opening the same cabinet in HH).
These granular diagnostics enable targeted architectural and prompt engineering improvements (Liu et al., 2023).
5. Recommendations for LLM-Agent Development
Quantitative and qualitative analyses of AgentBench runs have resulted in specific, empirically grounded recommendations:
- Code-Targeted Pretraining: LLMs pretrained on code (e.g., CodeLlama) show strong gains in structured planning and procedural action environments.
- High-Quality Multi-Turn Alignment: Models fine-tuned with multi-turn, high-quality dialog data (e.g., Vicuna-13B with ShareGPT) outperform those trained solely on generic instruction data.
- Long-Horizon Reasoning Techniques: High TLE rates underscore the need for structured chain-of-thought, reflection, search, and memory-augmented reasoning approaches.
- Rigorous Prompt/Format Engineering: Enforcing explicit output formats at each step can sharply reduce IF/IA errors; automated recovery or re-prompting mechanisms should be explored.
These recommendations are intended to guide the next generation of LLM agents toward higher reliability and capability in agentic settings (Liu et al., 2023).
6. Released Resources and Research Impact
AgentBench is fully open-sourced, providing:
- Eight dockerized environment simulators covering all task modalities.
- Official train/dev/test splits, answer scripts, and reproducible checking pipelines.
- An integrated evaluation toolkit with modular, max-flow scheduled HTTP architecture for scalable and controlled experiments.
- A continuously updated leaderboard (https://llmbench.ai/agent) and code repository (https://github.com/THUDM/AgentBench), supporting broad research participation.
AgentBench establishes the first comprehensive, open, and evolving foundation for agentic LLM benchmarking, catalyzing method development, reproducible experimentation, and comparative analysis across academic and industrial groups (Liu et al., 2023).