Papers
Topics
Authors
Recent
Search
2000 character limit reached

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

Published 12 Mar 2024 in cs.CL | (2403.07714v5)

Abstract: LLMs have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Large language models as tool makers. In The Twelfth International Conference on Learning Representations.
  3. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915.
  4. T-eval: Evaluating the tool utilization capability step by step. arXiv preprint arXiv:2312.14033.
  5. Gemini Team. 2023. Gemini: A family of highly capable multimodal models.
  6. Tanmay Gupta and Aniruddha Kembhavi. 2023. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962.
  7. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. arXiv preprint arXiv:2305.11554.
  8. Tool documentation enables zero-shot tool-usage with large language models. arXiv preprint arXiv:2308.00675.
  9. Metatool benchmark for large language models: Deciding whether to use tools and which to use. arXiv preprint arXiv:2310.03128.
  10. Genegpt: Augmenting large language models with domain tools for improved access to biomedical information. ArXiv.
  11. Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244.
  12. Api-bank: A benchmark for tool-augmented llms.
  13. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842.
  14. Umap: Uniform manifold approximation and projection. The Journal of Open Source Software, 3(29):861.
  15. Augmented language models: a survey. arXiv preprint arXiv:2302.07842.
  16. Webgpt: Browser-assisted question-answering with human feedback.
  17. OpenAI. 2023. Gpt-4 technical report.
  18. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334.
  19. Webcpm: Interactive web search for chinese long-form question answering. arXiv preprint arXiv:2305.06849.
  20. Tool learning with foundation models. arXiv preprint arXiv:2304.08354.
  21. Toolllm: Facilitating large language models to master 16000+ real-world apis.
  22. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  23. Tptu: Task planning and tool usage of large language model-based ai agents. arXiv preprint arXiv:2308.03427.
  24. Toolformer: Language models can teach themselves to use tools. ArXiv preprint, abs/2302.04761.
  25. Restgpt: Connecting large language models with real-world applications via restful apis. arXiv preprint arXiv:2306.06624.
  26. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.
  27. Llama: Open and efficient foundation language models.
  28. Alan M. Turing. 2009. Computing Machinery and Intelligence, pages 23–65. Springer Netherlands, Dordrecht.
  29. Mint: Evaluating llms in multi-turn interaction with tools and language feedback. arXiv preprint arXiv:2309.10691.
  30. Chain-of-thought prompting elicits reasoning in large language models.
  31. On the tool manipulation capability of open-source large language models.
  32. Chatgpt is not enough: Enhancing large language models with knowledge graphs for fact-aware language modeling. arXiv preprint arXiv:2306.11489.
  33. Gpt4tools: Teaching large language model to use tools via self-instruction.
  34. Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems, volume 35, pages 20744–20757. Curran Associates, Inc.
  35. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR).
  36. Tooleyes: Fine-grained evaluation for tool learning capabilities of large language models in real-world scenarios.
  37. Toolqa: A dataset for llm question answering with external tools. arXiv preprint arXiv:2306.13304.
Citations (17)

Summary

  • The paper introduces StableToolBench, a benchmark that uses a virtual API server and caching system to ensure reproducible evaluations of tool learning in LLMs.
  • It employs a GPT-4 powered API simulator to emulate real API behaviors, maintaining consistent performance despite external fluctuations.
  • Evaluation metrics like SoPR and SoWR demonstrate significant improvements in consistency and realism for large language model performances.

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of LLMs

Introduction

In the paper "StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of LLMs," the researchers present a newly designed benchmark called StableToolBench. This benchmark aims to address the limitations of previous tool benchmarks that either relied on limited-scope offline tools or large-scale, but unstable, real-world online APIs. StableToolBench introduces a novel virtual API server along with a stable evaluation system designed to ensure consistent and reliable performance assessments for LLMs involved in tool learning tasks.

Benchmark Design and Implementation

StableToolBench consists of a virtual API server that integrates a caching system and an API simulator to mitigate the dependency on real-world API stability.

Virtual API Server

  1. Caching System: The caching system captures the responses of API calls to maintain consistent behavior over time, reducing fluctuations due to API changes or outages.
  2. API Simulator: For APIs not covered by the cache, the simulator uses LLMs, specifically GPT-4, to emulate real API behaviors. This approach ensures the system can provide consistent responses regardless of external API status (Figure 1). Figure 1

    Figure 1: The process of calling APIs in our proposed virtual API server.

Through this dual mechanism, StableToolBench offers a more stable and reproducible environment for evaluating tool learning in LLMs.

Evaluation System

The paper introduces a stable evaluation metric system composed of Solvable Pass Rate (SoPR) and Solvable Win Rate (SoWR). The evaluation process involves:

  1. Determining Task Solvability: Using multiple advanced LLMs to ascertain whether a task is inherently solvable. Tasks identified as solvable avoid the variability introduced by ambiguities in the problem context (Figure 2). Figure 2

    Figure 2: The process of our SoPR evaluation.

  2. Automated Evaluation with Strong LLMs: Replacing traditional evaluators like GPT-3.5 with GPT-4 ensures more accurate assessments, mitigating randomness and improving correlation with human judgment.

Performance Analysis

The experimental results exhibit significant improvements in stabilizing model performances across various scenarios. The virtual API server maintains consistent performance metrics despite fluctuations in real API availability. Furthermore, the caching system, combined with advanced LLM-backed simulation, significantly enhances benchmark stability, as evidenced by the reproducibility of results across different configuration settings (Figure 3). Figure 3

Figure 3: Performance change when manually making APIs down with our virtual online API system. The results are averaged over all six groups.

Turing Test and Diversity Assessments

A critical part of assessing the validity of the API simulator is its ability to emulate real APIs convincingly. A "Turing Test" demonstrated that human evaluators found it difficult to distinguish between real and simulated API responses, affirming the simulator's robustness and realism (Figure 4). Figure 4

Figure 4: Results of the ``Turing Test'' for the real and simulated APIs. Results are win-lose-tie percentages.

Additionally, the assessment of API response diversity confirmed that the simulator maintains a rich variety of responses, akin to real API systems (Figure 5). Figure 5

Figure 5: Visualisation of the embeddings of responses from real and simulated APIs.

Conclusion

StableToolBench represents a significant step forward in creating reliable and scalable benchmarks for tool learning with LLMs. By resolving issues of reproducibility and stability prevalent in previous benchmarks, it provides a more realistic and dependable framework for evaluating the capabilities of LLMs in using external tools. Future work may explore integrating open-source models to further increase accessibility and reduce dependency on closed-source solutions.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub