TOOLVERIFIER: Generalization to New Tools via Self-Verification
Abstract: Teaching LLMs to use tools is an important milestone towards building general assistants, but remains an open problem. While there has been significant progress on learning to use specific tools via fine-tuning, LLMs still struggle with learning how to robustly use new tools from only a few demonstrations. In this work we introduce a self-verification method which distinguishes between close candidates by self-asking contrastive questions during (1) tool selection; and (2) parameter generation. We construct synthetic, high-quality, self-generated data for this goal using Llama-2 70B, which we intend to release publicly. Extensive experiments on 4 tasks from the ToolBench benchmark, consisting of 17 unseen tools, demonstrate an average improvement of 22% over few-shot baselines, even in scenarios where the distinctions between candidate tools are finely nuanced.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. ArXiv, abs/2211.12588.
- Chatcot: Tool-augmented chain-of-thought reasoning on chat-based large language models. ArXiv, abs/2305.14323.
- Chain-of-verification reduces hallucination in large language models. ArXiv, abs/2309.11495.
- Pal: Program-aided language models. ArXiv, abs/2211.10435.
- Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452.
- Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. ArXiv, abs/2305.11554.
- Unnatural instructions: Tuning language models with (almost) no human labor. ArXiv, abs/2212.09689.
- Tool documentation enables zero-shot tool-usage with large language models. arXiv preprint arXiv:2308.00675.
- Learning to reason and memorize with self-notes. arXiv preprint arXiv:2305.00833.
- Self: Language-driven self-evolution for large language model. ArXiv, abs/2310.00533.
- Self-refine: Iterative refinement with self-feedback. ArXiv, abs/2303.17651.
- Coarse2Fine: Fine-grained text classification on coarsely-grained annotated data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 583–594, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Leveraging QA datasets to improve generative data augmentation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9737–9750, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- ZEROTOP: Zero-shot task-oriented semantic parsing using large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5792–5799, Singapore. Association for Computational Linguistics.
- Talm: Tool augmented language models. ArXiv, abs/2205.12255.
- Gorilla: Large language model connected with massive apis. ArXiv, abs/2305.15334.
- Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350.
- Toolllm: Facilitating large language models to master 16000+ real-world apis. ArXiv, abs/2307.16789.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
- Toolformer: Language models can teach themselves to use tools. In Thirty-seventh Conference on Neural Information Processing Systems.
- Timo Schick and Hinrich Schütze. 2021. Generating datasets with pretrained language models. ArXiv, abs/2104.07540.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. ArXiv, abs/2303.17580.
- Screws: A modular framework for reasoning with revisions. ArXiv, abs/2309.13075.
- The art of llm refinement: Ask, refine, and trust. ArXiv, abs/2311.07961.
- Restgpt: Connecting large language models with real-world applications via restful apis. arXiv preprint arXiv:2306.06624.
- Nexusraven: a commercially-permissive language model for function calling. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
- Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. ArXiv, abs/2306.05301.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Self-instruct: Aligning language models with self-generated instructions. In Annual Meeting of the Association for Computational Linguistics.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.
- Baize: An open-source chat model with parameter-efficient tuning on self-chat data. ArXiv, abs/2304.01196.
- On the tool manipulation capability of open-source large language models. ArXiv, abs/2305.16504.
- Toolbench leaderboard. https://huggingface.co/spaces/qiantong-xu/toolbench-leaderboard. Accessed: Feb 15 2024.
- Gpt4tools: Teaching large language model to use tools via self-instruction. ArXiv, abs/2305.18752.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- Teaching language models to self-improve through interactive demonstrations. ArXiv, abs/2310.13522.
- Self-convinced prompting: Few-shot question answering with repeated introspection. ArXiv, abs/2310.05035.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.