Berkeley Function Calling Leaderboard v4

Updated 10 February 2026

BFCLv4 is an academic benchmark that measures large language models’ proficiency in selecting, formatting, and executing API calls from natural language queries.
It employs a multi-domain, zero-shot evaluation protocol with strict JSON schema adherence and metrics like AST, execution, and relevance accuracy.
The leaderboard compares closed and open models, providing actionable insights through detailed metrics and robustness evaluations against adversarial inputs.

The Berkeley Function Calling Leaderboard v4 (BFCLv4) is an academic benchmark for evaluating the function-calling capabilities of LLMs. It systematically quantifies models’ proficiency in selecting, formatting, and executing calls to external APIs or functions based on natural language user requests, across a diverse range of domains and levels of task complexity. BFCLv4 is recognized as a de facto standard for empirical evaluation in function-calling agent research, offering rigorous, multi-faceted metrics and a high degree of transparency and reproducibility.

1. Benchmark Design and Evaluation Protocol

BFCLv4 is designed as a multi-domain, multi-language testbed encompassing both synthetic and real-world function-calling scenarios. Each benchmark instance presents the model with a natural language query, a schema of available functions/APIs (including argument names, types, and return types), and, in some cases, multi-turn dialogue histories (Liu et al., 2024, Abdelaziz et al., 2024, Zhang et al., 2024).

The task for the model is to produce either:

A valid function call in canonical JSON or code format, or
"no_call" if no API in the provided schema is suitable.

The required output JSON format is:

{
  "name": "<function_name>",
  "args": { "param1": value1, ... }
}

with strict type-matching (e.g., quoted strings, unquoted numerics, literal booleans).

BFCLv4 aggregates multiple task categories:

Simple: single-API, single-call queries
Multiple: multiple candidate APIs, single call
Parallel: repeated calls to a single API
Parallel Multiple: multiple APIs and multiple calls in a single response
Relevance Detection: queries for which none of the presented APIs are relevant (Liu et al., 2024, Abdelaziz et al., 2024)

Zero-shot evaluation is the standard regime: no in-context learning or prompt exemplars.

2. Core Metrics and Scoring Formulation

BFCLv4 employs orthogonal, automated evaluation metrics with category-specific reporting (Liu et al., 2024, Abdelaziz et al., 2024, Zhang et al., 2024):

AST Accuracy (ASTAcc): Fraction of predictions whose parsed abstract syntax tree (function name + argument names/types) exactly matches the gold reference.
Execution Accuracy (ExecAcc): Fraction of calls that, when invoked in a sandbox, yield the ground-truth output or the correct HTTP response.
Irrelevance Detection (IrrelAcc): Fraction of “no_call” cases where the model correctly abstains from outputting a function call.
Relevance Detection (RelAcc): Fraction of queries requiring calls for which the model emits at least one correct function call.
Overall Score: Default in v4 is the (unweighted or category-weighted) average of the four principal metrics (Zhang et al., 2024):

$\mathrm{Overall} = \frac{1}{4}(\mathrm{ASTAcc} + \mathrm{ExecAcc} + \mathrm{IrrelAcc} + \mathrm{RelAcc}) \times 100$

Relevance F1 (Relevance Detection split):

$\text{F1} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

Per-category (Simple, Multiple, Parallel, Parallel Multiple) and global overall metrics are published, ensuring fine-grained diagnostics and aggregate comparability.

3. Dataset Construction and Verification Methodologies

Dataset curation is a multi-stage, semi-automated process, combining data diversity and rigorous verification:

Synthetic Data Synthesis: Leading pipelines such as ToolACE and APIGen utilize hierarchical API evolution, multi-agent dialog simulation, and combinatorial sampling to generate diverse function schemas and user queries across 21–30 high-level categories and 3,000+ endpoints (Liu et al., 2024, Liu et al., 2024).
Verification Stages:
- Format Checking: Structural adherence to JSON schema, argument validity, and parseability.
- Execution Checking: Candidate calls are executed in sandboxed environments; runtime errors lead to rejection.
- Semantic Verification: LLM-judges validate call correctness and query alignment, often via majority-vote or decomposed expert prompting.
Quality bottlenecks are quantitatively monitored (e.g., pass rates of ~60% after all filters in ToolACE, final precision ~95% post-APIGen pipeline) (Liu et al., 2024, Liu et al., 2024).

Notably, adversarial and edge-case data generation is actively used to prevent models from overfitting to "easy" schema patterns, with adversarial refinement loops embedded in data pipelines (Zhang et al., 2024).

4. Model Training Paradigms and Leaderboard Performance

The leaderboard catalogs both closed-source and open-source models, evaluated strictly on held-out test sets. Key methodologies represented include:

Supervised Fine-Tuning (SFT): Standard approach using high-quality, verified datasets such as those from APIGen and ToolACE.
Granular Task Decomposition: Multi-task architectures (e.g., Granite-20B-FunctionCalling) train on tasks including function chaining, parallel calls, parameter-value mapping, and abstention detection (Abdelaziz et al., 2024).
Process Supervision and Feedback: Line-level code execution feedback (as in ADC) provides stepwise guidance, boosting format adherence and parameter inference (Zhang et al., 2024).
Reinforcement Learning (RL) with Policy Optimization:
- Group Relative Policy Optimization (GRPO) and reward-conditioning, as in RC-GRPO, aim to address reward sparsity and improve exploration in multi-turn tool calling scenarios (Zhong et al., 3 Feb 2026).
- Entropy-regularized RL and strategic CoT exploration enhance robustness on complex, multi-call tasks (Hao et al., 7 Aug 2025).

Representative Results

A selection of models and their metrics are summarized below (all results taken directly from source):

Model	ASTAcc	ExecAcc	IrrelAcc	RelAcc	Overall
GPT-4-0125	85.50	89.25	61.35	97.56	83.42
Meta-Llama-3-70B	80.15	88.04	50.47	92.68	77.84
xLAM-7B-fc-r	72.77	85.68	79.76	80.49	79.17
ADC (LLM+Adv Datasets)	70.46	87.50	75.67	82.89	79.13
ToolACE-8B	≈92	≈90	≈89	--	≈91.5
Qwen2.5-Coder-7B-Instruct (FunRL)	90.40	81.64	--	--	86.02
Granite-20B-FunctionCalling	84.11	86.50	87.08	--	84.71
RC-GRPO (Qwen-2.5-7B-Instruct)	--	--	--	--	85.0

BFCLv4 rankings consistently feature ToolACE’s 8B model as surpassing GPT-4 and Claude-3.5 in overall accuracy, and Granite-20B as leading all openly licensed models. Recent reinforcement learning approaches (FunRL, RC-GRPO) on code-pretrained or reward-conditioned models now match or exceed proprietary systems (Liu et al., 2024, Hao et al., 7 Aug 2025, Zhong et al., 3 Feb 2026).

5. Robustness, Generalization, and Failure Modes

Recent investigations highlight the sensitivity of function-calling agents to perturbations:

Naturalistic Paraphrases: Rephrasings of the input query often cause exact-match AST accuracy to drop by 13–19 percentage points.
Toolkit Expansion: Introducing semantically related distractor functions yields an average drop of 1–8 points; most errors are due to wrong function selection or incorrect parameter mapping (Rabinovich et al., 1 Apr 2025).

Robustness metrics include:

$R_{\mathrm{nat}} = \frac{\text{AST}_{\mathrm{nat}}}{\text{AST}_0},\quad R_{\mathrm{stab}} = \frac{\text{AST}_{\mathrm{exp}}}{\text{AST}_0}$

These analyses underscore the need for model architectures and training regimes that explicitly target semantic slot-value matching and adversarial query coverage.

6. Advances in Data Synthesis and Multi-turn Function Calling

Emerging methodologies for overcoming data and supervision bottlenecks include:

Structured Data Synthesis: Frameworks such as FunReason-MT leverage environment–API interaction graphs, advanced abstraction, and guided iterative reasoning to generate high-quality, logically dependent multi-turn trajectories (Xu et al., 28 Oct 2025).
Multi-agent and Reward-conditioned RL: RC-GRPO combines reward-conditioned trajectory policy pretraining and group-relative RL, yielding state-of-the-art multi-turn tool calling results and demonstrating the necessity of explicit reward diversity to prevent reward collapse (Zhong et al., 3 Feb 2026).

In multi-turn (3–10 step) environments, trajectory-level binary rewards aggregate state accuracy and action correctness, with overall accuracy on realistic agentic tool-use tasks now improved significantly over early SFT and standard RL baselines.

7. Implications, Best Practices, and Limitations

BFCLv4 provides an empirically robust platform for evaluating both closed and open models. Rigorous data validation, task diversity, and zero-shot protocols foster results with substantial generalizability. Leading papers recommend:

Adopting AST-parsing, execution validation, and LLM-judge based semantic checking for both data curation and evaluation
Integrating adversarial perturbation and tool expansion both in training and robustness testing (Zhang et al., 2024, Rabinovich et al., 1 Apr 2025)
Designing training pipelines with process supervision, reward diversity, and staged format/semantic objectives

Current limitations include the benchmark’s primary focus on JSON-style single/multi-turn calls, with active research aimed at incorporating richer toolchains, more diverse API paradigms, and improved real-world deployment relevance (Liu et al., 2024, Rabinovich et al., 1 Apr 2025).