Salesforce XLAM Function Calling Dataset
- Salesforce XLAM Function Calling Dataset is a comprehensive collection of 60,000 examples generated through a multi-stage APIGen pipeline, ensuring high syntactic, execution, and semantic accuracy.
- It covers 21 diverse domains and 3,673 APIs, offering robust data for training and evaluating language models in function-calling tasks.
- The dataset's hierarchical quality checks enable precise benchmarking and fine-tuning of models, improving both function execution and relevance detection.
The Salesforce XLAM Function Calling Dataset is a large-scale, rigorously verified collection of function-calling examples, targeting the training and evaluation of language agent models in function-calling domains. Generated using the APIGen pipeline, the dataset comprises 60,000 high-quality entries, each representing user queries paired with one or more function/API calls and their structured arguments, across 21 domains and 3,673 distinct APIs. Its multi-stage automated validation ensures not only syntactic and executable correctness but also semantic alignment with user intent, enabling development and benchmarking of advanced function-calling LLM agents (Liu et al., 2024).
1. Automated Data Generation and Verification Pipeline
The dataset is synthesized via the APIGen pipeline, a three-stage process ensuring hierarchical filtering of generated examples:
Stage 1: Format Checker
- Input: LLM-generated JSON string.
- Checks enforce strict JSON parseability, presence of required top-level fields (“query”, “answers”), and validation that each function call refers to a sampled API with correctly structured arguments.
- Failure modes: malformed JSON, missing fields, or hallucinated functions/parameters.
Stage 2: Execution Checker
- Input: Formally correct function call entries.
- Python functions are imported and invoked in a controlled subprocess; REST APIs are accessed via HTTP requests using the specified arguments.
- Pass criteria require type safety, argument completeness, absence of runtime errors or timeouts, and valid results.
- Failure modes: runtime exception, invalid parameter structure, or network errors.
Stage 3: Semantic Checker
- Inputs: (query, function calls, execution results).
- A separate LLM evaluates whether the function choices, argument values, and call multiplicity fully satisfy the user’s query.
- Output: Structured JSON; a “pass”/“no” decision.
- Failure is triggered by partial or irrelevant outputs, semantic drift, or accidental correctness.
To formalize filtering effectiveness, the following metrics encapsulate pass rates:
- (format pass rate)
- (execution pass rate)
- (semantic pass rate)
- (overall yield)
2. Dataset Structure and Content
The released dataset consists of 60,000 entries, each structured as a JSON object with schema:
"query": The user’s natural language task."tools": An array of API descriptors (name, description, parameter schema)."answers": An array of function-call records specifying API name and arguments.
APIs are derived from both REST (3,539) and Python (134) endpoints, spanning 21 consolidated categories—examples include finance, weather, social media, education, mathematics, sports, technology, travel, and health. Category distributions are balanced within a head of ~10–12% for top types and a long tail for specialized APIs.
Example of a simple call:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
{
"query": "What is the weather in Palo Alto?",
"tools": [
{
"name": "weather_api.get_current_weather",
"description": "Retrieves the current weather.",
"parameters": {
"location": { "type": "string", "required": true },
"units": { "type": "string", "required": false }
}
}
],
"answers": [
{
"name": "weather_api.get_current_weather",
"arguments": { "location": "Palo Alto", "units": "Celsius" }
}
]
} |
Example of a parallel/multi-call:
1 2 3 4 5 6 7 8 9 10 11 |
{
"query": "Sum multiples of 3 & 5 up to 1000, and product of first five primes.",
"tools": [
{ "name": "math_toolkit.sum_of_multiples", ... },
{ "name": "math_toolkit.product_of_primes", ... }
],
"answers": [
{ "name": "math_toolkit.sum_of_multiples", "arguments": {"lower_limit":1,"upper_limit":1000,"multiples":[3,5]} },
{ "name": "math_toolkit.product_of_primes", "arguments": { "count":5 } }
]
} |
3. Quality Filtering and Validation Metrics
The APIGen pipeline achieves high yields of correct data, with pass rates varying substantially based on LLM generation source. Quantitative breakdown from a 40,000-sample generation:
- DeepSeek-V2-Chat (236B): 33,659 verified (84.15% yield)
- Mixtral-8x22B-Inst: 26,384 verified (65.96% yield)
- Mixtral-8x7B-Inst: 15,385 verified (38.46% yield)
- DeepSeek-Coder-33B-Inst: 13,769 verified (34.42% yield)
Human spot-check evaluation on 600 examples produces an estimated 95.3% human-verified correctness. The three-stage verification systematically filters format, executional, and semantic errors, reducing accidental correctness and confirming agent-relevant function–argument alignment.
On the evaluation side (post-training), the two principal metrics adopted from the BFCL are:
- AST Accuracy: Proportion where the exact predicted function call tree matches the canonical “ground truth” AST.
- Executable Accuracy: Proportion of predictions yielding syntactically and runtime-valid outputs.
4. Model Benchmarking and Comparative Performance
Function-calling models fine-tuned on this dataset are evaluated on the 2,000-case Berkeley Function-Calling Leaderboard (BFCL), comparing accuracy across several query/call styles (“Simple”, “Multiple”, “Parallel”, “Parallel-Multiple”). As of June 2024, salient results include:
| Model | Overall Acc (%) | Simple AST (%) | Parallel AST (%) | Relevance (%) | Rank |
|---|---|---|---|---|---|
| xLAM-7B (FC) | 85.65 | 80.55 | 90 | 80.42 | 6 |
| xLAM-1B (FC) | 74.41 | 75.09 | 76.5 | 72.08 | 24 |
| GPT-4-Preview | 88.00 | — | — | — | 1 |
| Claude-3-Haiku | 74.29 | — | — | — | 25 |
| GPT-3.5-Turbo | 63.88 | — | — | — | 33 |
xLAM-7B matches GPT-4 in function-calling accuracy despite <10% of its size, and xLAM-1B surpasses Claude-3-Haiku and GPT-3.5-Turbo by several percentage points. The results substantiate the benefit of the APIGen data generation pipeline and validation stages (Liu et al., 2024).
5. Usage and Fine-Tuning Procedures
Dataset loading is standardized via Huggingface Datasets:
1 2 3 |
from datasets import load_dataset ds = load_dataset("Salesforce/xlam-function-calling-60k") train = ds["train"] # 60 K entries |
1 2 3 4 5 6 7 8 9 |
example = train[0] print("Query:", example["query"]) print("Tools:", example["tools"]) print("Answer calls:", example["answers"]) call = example["answers"][0] func_name = call["name"] args = call["arguments"] if func_name == "weather_api.get_current_weather": result = weather_api.get_current_weather(**args) |
For fine-tuning, guidelines from the APIGen paper specify:
- Base model: instruction-tuned LLM supporting function-calling and JSON output.
- Learning rate: , AdamW optimizer, cosine LR scheduling (50 warmup steps).
- Batch size/device: 6; gradient accumulation: 2; epochs: 4; max length: 2048; bfloat16.
- Inclusion of ~8,000 “relevance detection” entries to teach the model to refuse unrelated/irrelevant queries.
- Regular monitoring of stagewise pass rates on held-out APIGen data to mitigate overfitting.
6. Significance and Research Directions
The Salesforce/xlam-function-calling-60k dataset sets a new standard for the rigor and breadth of function-calling benchmarks. Its hierarchical quality filters—format, executable, and semantic—produce data with high fidelity for agent-centric LLM research. The benchmarks show that function-calling performance of open models can approach or match commercial LLMs with orders-of-magnitude fewer parameters, which is notable for both academic research and practical application development.
The dataset’s public availability and comprehensive verification protocol encourage reproducibility and represent a reference for the development of function-calling agents that must generalize over a wide spectrum of APIs and user intents. The structured approach to “relevance detection” further supports research into model refusal behaviors and alignment (Liu et al., 2024).