Hammer: Robust Function-Calling for On-Device Language Models via Function Masking
Abstract: LLMs have demonstrated impressive value in performing as autonomous agents when equipped with external tools and API calls. Nonetheless, effectively harnessing their potential for executing complex tasks crucially relies on enhancements in their function calling capabilities. This paper identifies a critical gap in existing function calling models, where performance varies significantly across benchmarks, often due to being misled by specific naming conventions. To address such an issue, we introduce Hammer, a novel family of foundation models specifically engineered for on-device function calling. Hammer employs an augmented dataset that enhances models' sensitivity to irrelevant functions and incorporates function masking techniques to minimize misleading. Our empirical evaluations reveal that Hammer not only outperforms larger models but also demonstrates robust generalization across diverse benchmarks, achieving sota results. Our open source contributions include a specialized dataset for irrelevance detection, a tuning framework for enhanced generalization, and the Hammer models, establishing a new standard for function calling performance.
- KR1442 Chowdhary and KR Chowdhary. Natural language processing. Fundamentals of artificial intelligence, pages 603–649, 2020.
- Reinforcing language agents via policy optimization with action decomposition. arXiv preprint arXiv:2405.15821, 2024.
- Apple intelligence foundation language models, 2024. URL https://arxiv.org/abs/2407.21075.
- Granite-function calling model: Introducing function calling abilities via multi-task learning of granular tasks. arXiv preprint arXiv:2407.00121, 2024.
- Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
- Api-bank: A comprehensive benchmark for tool-augmented llms, 2023. URL https://arxiv.org/abs/2304.08244.
- Seal-tools: Self-instruct tool learning dataset for agent tuning and detailed benchmark, 2024. URL https://arxiv.org/abs/2405.08355.
- xlam: A family of large action models to empower ai agent systems, 2024. URL https://arxiv.org/abs/2409.03215.
- τ𝜏\tauitalic_τ-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024. URL https://arxiv.org/abs/2406.12045.
- Travelagent: An ai assistant for personalized travel planning, 2024a. URL https://arxiv.org/abs/2409.08069.
- Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022.
- Apigen: Automated pipeline for generating verifiable and diverse function-calling datasets, 2024a. URL https://arxiv.org/abs/2406.18518.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Gpt-4o: The cutting-edge advancement in multimodal llm. Authorea Preprints, 2024.
- Berkeley function calling leaderboard. https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html, 2024.
- Toolalpaca: Generalized tool learning for language models with 3000 simulated cases, 2023.
- Nexusraven: A commercially-permissive language model for function calling. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023. URL https://openreview.net/forum?id=5lcPe6DqfI.
- Tinyagent: Function calling at the edge. arXiv preprint arXiv:2409.00608, 2024a.
- Octopus: On-device language model for function calling of software apis. arXiv preprint arXiv:2404.01549, 2024b.
- Toolace: Winning the points of llm function calling, 2024b. URL https://arxiv.org/abs/2409.00920.
- A mechanism of function calls in msvl. Theoretical Computer Science, 654, 03 2016. doi: 10.1016/j.tcs.2016.02.037.
- An LLM compiler for parallel function calling. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 24370–24391. PMLR, 21–27 Jul 2024. URL https://proceedings.mlr.press/v235/kim24y.html.
- Api-blend: A comprehensive corpora for training and benchmarking api llms, 2024. URL https://arxiv.org/abs/2402.15491.
- Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
- Tinyagent: Function calling at the edge. https://bair.berkeley.edu/blog/2024/05/29/tiny-agent/, 2024b.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
- Carl A Gunter. Semantics of programming languages: structures and techniques. MIT press, 1992.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.
- Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.