NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls
Abstract: The resurgence of autonomous agents built using LLMs to solve complex real-world tasks has brought increased focus on LLMs' fundamental ability of tool or function calling. At the core of these agents, an LLM must plan, execute, and respond using external tools, APIs, and custom functions. Research on tool calling has gathered momentum, but evaluation benchmarks and datasets representing the complexity of the tasks have lagged behind. In this work, we focus on one such complexity, nested sequencing, with the goal of extending existing benchmarks and evaluation. Specifically, we present NESTFUL, a benchmark to evaluate LLMs on nested sequences of API calls, i.e., sequences where the output of one API call is passed as input to a subsequent call. NESTFUL contains 1800+ nested sequences where all the function calls are executable. Experimental results on a variety of models show that the best-performing model (GPT-4o) achieves a full sequence match accuracy of 28% and a win-rate of 60%, necessitating a large scope for improvement in the nested sequencing aspect of function calling. Our analysis of these results provides possible future research directions for the community, in addition to a benchmark to track progress. We have released the NESTFUL dataset under the Apache 2.0 license at https://github.com/IBM/NESTFUL.
- Granite-function calling model: Introducing function calling abilities via multi-task learning of granular tasks. arXiv preprint arXiv:2407.00121, 2024.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Api-blend: A comprehensive corpora for training and benchmarking api llms, 2024. URL https://arxiv.org/abs/2402.15491.
- Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36, 2024.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
- Gorilla openfunctions v2. 2024.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations.
- Api-bank: A comprehensive benchmark for tool-augmented llms, 2023.
- Apigen: Automated pipeline for generating verifiable and diverse function-calling datasets. arXiv preprint arXiv:2406.18518, 2024.
- Granite code models: A family of open foundation models for code intelligence. arXiv preprint arXiv:2405.04324, 2024.
- Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
- Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
- Exploring llm-based agents for root cause analysis. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, pp. 208–219, 2024.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Restgpt: Connecting large language models with real-world restful apis. arXiv preprint arXiv:2306.06624, 2023.
- Nexusraven: a commercially-permissive language model for function calling. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
- Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301, 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- A language-agent approach to formal theorem-proving. In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23, 2023.
- On the tool manipulation capability of open-source large language models. arXiv preprint arXiv:2305.16504, 2023.
- React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023.
- Agentohana: Design unified data and training pipeline for effective agent learning. arXiv preprint arXiv:2402.15506, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.