T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step
Abstract: LLMs (LLM) have achieved remarkable performance on various NLP tasks and are augmented by tools for broader applications. Yet, how to evaluate and analyze the tool-utilization capability of LLMs is still under-explored. In contrast to previous works that evaluate models holistically, we comprehensively decompose the tool utilization into multiple sub-processes, including instruction following, planning, reasoning, retrieval, understanding, and review. Based on that, we further introduce T-Eval to evaluate the tool utilization capability step by step. T-Eval disentangles the tool utilization evaluation into several sub-domains along model capabilities, facilitating the inner understanding of both holistic and isolated competency of LLMs. We conduct extensive experiments on T-Eval and in-depth analysis of various LLMs. T-Eval not only exhibits consistency with the outcome-oriented evaluation but also provides a more fine-grained analysis of the capabilities of LLMs, providing a new perspective in LLM evaluation on tool-utilization ability. The benchmark will be available at https://github.com/open-compass/T-Eval.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
- Benchmarking large language models in retrieval-augmented generation. arXiv preprint arXiv:2309.01431.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Dynamic planning with a llm. arXiv preprint arXiv:2308.06391.
- Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861.
- Breaking nli systems with sentences that require simple lexical inferences. arXiv preprint arXiv:1805.02266.
- Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736.
- Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. arXiv preprint arXiv:2305.11554.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- John E Hopcroft and Richard M Karp. 1973. An n^5/2 algorithm for maximum matchings in bipartite graphs. SIAM Journal on computing, 2(4):225–231.
- Tool documentation enables zero-shot tool-usage with large language models. arXiv preprint arXiv:2308.00675.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Challenges and applications of large language models. arXiv preprint arXiv:2307.10169.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Halueval: A large-scale hallucination evaluation benchmark for large language models. arXiv e-prints, pages arXiv–2305.
- Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244.
- Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks. arXiv preprint arXiv:2305.17390.
- Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688.
- Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization. arXiv preprint arXiv:2310.02170.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS).
- Augmented language models: a survey. arXiv preprint arXiv:2302.07842.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255.
- Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334.
- Tool learning with foundation models.
- Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
- Tptu: Task planning and tool usage of large language model-based ai agents. arXiv preprint arXiv:2308.03427.
- Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
- InternLM Team. 2023. Internlm: A multilingual language model with progressively enhanced capabilities.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
- Scienceworld: Is your agent smarter than a 5th grader? arXiv preprint arXiv:2203.07540.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Rewoo: Decoupling reasoning from observations for efficient augmented language models. arXiv preprint arXiv:2305.18323.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911.
- Toolqa: A dataset for llm question answering with external tools. arXiv preprint arXiv:2306.13304.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.