On the Tool Manipulation Capability of Open-source Large Language Models

Published 25 May 2023 in cs.CL, cs.AI, and cs.LG | (2305.16504v1)

Abstract: Recent studies on software tool manipulation with LLMs mostly rely on closed model APIs. The industrial adoption of these models is substantially constrained due to the security and robustness risks in exposing information to closed LLM API services. In this paper, we ask can we enhance open-source LLMs to be competitive to leading closed LLM APIs in tool manipulation, with practical amount of human supervision. By analyzing common tool manipulation failures, we first demonstrate that open-source LLMs may require training with usage examples, in-context demonstration and generation style regulation to resolve failures. These insights motivate us to revisit classical methods in LLM literature, and demonstrate that we can adapt them as model alignment with programmatic data generation, system prompts and in-context demonstration retrievers to enhance open-source LLMs for tool manipulation. To evaluate these techniques, we create the ToolBench, a tool manipulation benchmark consisting of diverse software tools for real-world tasks. We demonstrate that our techniques can boost leading open-source LLMs by up to 90% success rate, showing capabilities competitive to OpenAI GPT-4 in 4 out of 8 ToolBench tasks. We show that such enhancement typically requires about one developer day to curate data for each tool, rendering a recipe with practical amount of human supervision.

Abstract PDF Upgrade to Chat

Citations (61)

View on Semantic Scholar

Summary

The paper shows that open-source LLMs can achieve up to a 90% success rate in tool manipulation tasks through programmatic data generation and system prompts.
It details an action generation framework that uses API documentation and in-context demonstrations to improve the accuracy of tool invocation.
Empirical results on the ToolBench benchmark indicate that minimal human supervision yields performance comparable to closed LLM APIs.

On the Tool Manipulation Capability of Open-source LLMs

Introduction

The capability of LLMs to manipulate software tools has garnered significant attention. However, most attempts in this domain leverage closed LLM APIs, posing security and robustness challenges in industrial settings. This paper addresses the potential of open-source LLMs, augmented with programmatic data generation, system prompts, and in-context demonstration retrievers, to match the capabilities of closed LLM APIs with minimal human supervision.

Tool Manipulation Setup

The framework for tool manipulation involves equipping LLMs as action generators that interface with software via API documentation access. In this study, the action generation encompasses both single-step and multi-step scenarios, where the LLM iteratively generates API calls based on environmental feedback until a goal is achieved.

Figure 1: Tool manipulation setup with API documentation access for LLMs to function as action generators.

Challenges in Open-source LLMs

Open-source LLMs encounter several hurdles in tool manipulation, primarily in API selection, argument populating, and the generation of executable code. While closed LLMs like GPT-4 exhibit innate knowledge of APIs, open-source models struggle with correctly selecting and invoking APIs without explicit examples. Additionally, they often falter at generating argument values for API calls even when the selections are correct.

Enhancement Techniques

To mitigate these issues, the paper revisits classical LLM techniques for adaptation in tool manipulation tasks:

Model Alignment with Programmatic Data Generation: This involves instruction tuning with synthetic data created from templates, enabling models to learn API usage effectively.
In-context Demonstration Retrievers: Drawing from retrieval-augmented generation, this module selects contextually similar demonstration examples to inform the model during inference.
System Prompts: Enhanced system prompts define clear guidelines to restrict models to generate executable API calls exclusively.
Figure 2: Use of all-shot loss for model alignment, highlighting the concatenation of examples and backpropagation through blue actions.

Evaluation and Results

The presented ToolBench—a benchmark suite of diverse tools—serves as the testing ground for these enhancements. Through empirical evaluations, open-source LLMs demonstrated up to a 90% increase in success rate in tool manipulation tasks, rivaling GPT-4 in half of the benchmarked tools. The minimal supervision requirement, equating to roughly a developer day per tool for data curation and example crafting, underscores the practical viability of these enhancements.

Figure 3: Spearman's correlation coefficient comparisons for complexity score and error rate, showcasing improvements in tasks.

Conclusion

This research substantiates the potential of open-source LLMs in securing advanced tool manipulation capabilities comparable to those of closed LLM APIs. By addressing key challenges with programmatic alignment and demonstrations, these models can achieve substantial improvements with practical levels of human oversight. Future work could expand on these insights by exploring even more nuanced integration of feedback mechanisms and hybrid learning across tasks.

Markdown