DiaFORGE Pipeline: Robust API Invocation
- The DiaFORGE pipeline is a disambiguation-centric three-stage framework that improves LLM reliability in handling near-duplicate APIs and underspecified user inputs.
- It synthesizes persona-driven, multi-turn dialogues and applies supervised fine-tuning with explicit reasoning traces to guide accurate API selection.
- Dynamic evaluation via DiaBENCH reveals significant tool-calling accuracy gains over baseline models in complex enterprise scenarios.
DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation) is a disambiguation-centric, three-stage pipeline designed to enhance the reliability of LLMs in enterprise tool-calling scenarios characterized by near-duplicate APIs and frequent input underspecification. The framework specifically targets the limitations of off-policy, single-turn benchmarks, which fail to expose cascading errors arising from premature or incorrect API calls. By incorporating synthetic, persona-driven multi-turn dialogues, supervised fine-tuning with explicit reasoning traces, and stringent dynamic evaluation, DiaFORGE provides a comprehensive methodology for constructing and validating LLM-based agents capable of robust, goal-directed API invocation in complex enterprise settings (Hathidara et al., 4 Jul 2025).
1. Motivation and Architectural Overview
Modern enterprises expose large tool catalogues with thousands of APIs, many distinguished only by minor variants (e.g., CreateCustomer vs. CreateUser). User requests tend to be ambiguous or lack required arguments, challenging LLMs that must correctly select from highly similar tools and elicit missing information. Premature or incorrect API selection in these contexts can yield cascading failures, an error class not uncovered by static, single-turn off-policy benchmarks.
DiaFORGE addresses these challenges through a pipeline comprising:
- Synthetic Dialogue Generation (Stage I) using persona-driven, multi-turn conversations that force disambiguation and argument elicitation.
- Supervised Fine-Tuning (Stage II) of open-source LLMs, tightly coupling model outputs with reasoning traces.
- Dynamic Evaluation Suite (Stage III; DiaBENCH) deploying the model in an agentic loop to measure real-world robustness and goal completion under live conditions.
2. Synthetic Dialogue Corpus Generation
2.1 Identifying Near-Duplicate Tools and Underspecified Arguments
The process begins with an enterprise tool catalogue
$\mathcal{T} = \{\,\tau_i=(\name_i,\desc_i,\params_i)\}_{i=1}^{|\mathcal T|}$
Each tool is associated with a JSON-Schema parameter set and a set of required slots . To construct realistic distractor sets, a frozen sentence encoder embeds tool descriptions. For a seed tool : $\D_k(\tau^\star) = \mathop{\mathrm{arg\,top\text{-}k}_{\tau\neq\tau^\star}} \langle\,\phi(\tau^\star),\,\phi(\tau)\rangle$ The assistant is provided candidate set $\C_k(\tau^\star)=\{\tau^\star\}\cup\D_k(\tau^\star)$.
Argument underspecification is enforced by incrementally revealing a subset of the gold slot-value map, $\V^\star=\{(r_i,v_{r_i})\mid r_i\in\R(\tau^\star)\}$, such that the assistant must actively elicit the remaining required values.
2.2 Dialogue Synthesis via UTC-Gen
Dialogue synthesis alternates turns between a user-proxy agent (instantiated with a latent goal and persona ) and the assistant. At each turn :
- $u_t\sim P_{\theta_u}\bigl(\cdot\mid\tau^\star,p,g,\D_k,\V^\star,\h^u_t\bigr)$
- $a_t\sim P_{\theta_a}\bigl(\cdot\mid \C_k,\h^a_t\bigr)$
Interaction continues until the assistant issues a schema-conformant tool call to using exactly $\V^\star$, or . The process is formalized in the pseudocode:
1 2 3 4 5 6 7 8 9 10 |
hu, ha = [], [] for t in range(1, Tmax+1): ut = sample_user(τ*, P, G, Dk, V*, hu) hu.append(ut) ha.append(ut) at = sample_assistant(Ck, ha) ha.append(at) if is_valid_call(at, τ*, V*): break validate_and_accept(hu, ha) |
2.3 Validation Cascade
To ensure data quality, each candidate dialogue is screened through:
- Format Validator: Enforces alternating turns and correct JSON stub structure.
- Tool-Call & Tool-Args Validator: Confirms correct API selection and full required argument coverage.
- Relevancy Validator & LLM Critique: Ensures persona coherence and proper dialogue structure.
Dialogues failing any validation gate are discarded.
3. Supervised Fine-Tuning with Reasoning Traces
3.1 Model Families and Parameterization
The corpus of validated multi-turn dialogues ( assistant turns) is used to fine-tune open models:
- Llama-3.2-3B
- Gemma-3-4B, 12B, 27B
- Llama-3.3-Nemotron-Super-49B
- Llama-3.3-70B
3.2 Reasoning Trace Integration
Assistant turns comprise > … (private reasoning) plus public response (with optional "tool_calls" stubs). Both are supervised during fine-tuning, requiring the model to produce both a private chain-of-thought and a public-facing utterance.
3.3 Training Setup
For each assistant turn in dialogue , input/output pairs are constructed:
- Loss masking is applied so that gradients are incurred only on tokens.
- Fine-tuning uses Low-Rank Adaptation (LoRA) with rank , .
- Training proceeds for a single epoch, using 8-bit precision, batch size 1.
- Optimizer: AdamW with peak learning rate , cosine schedule.
- Objective: next-token negative log likelihood
4. Dynamic Evaluation with DiaBENCH
4.1 Live Agentic Testing Loop
Each fine-tuned model is tested by deploying it as the assistant in the UTC-Gen loop, with a frozen user-proxy agent . At every dynamic turn , the assistant observes the cumulative dialogue history $\hat{\h}^a_t$ and candidate set $\C_k$; the user-proxy samples three candidate responses, using GPT-4o as a voter to minimize hallucination, and the interaction progresses until completion or truncation.
4.2 Metrics and Benchmarking
DiaBENCH provides both static (off-policy) and dynamic (on-policy) metrics:
- Tool-Calling: Accuracy Rate (Acc), False-Positive Tool-Call Rate (FTR), Tool-Call Abstention Rate (TAR), tool-level and argument-level Precision/Recall (TCP/TCR, PKP/PKR).
- Conversational: Relevancy via LLM rubric (ConvRel), Type–Token Ratio (TTR), -Gram Diversity (NGD).
On 119 out-of-domain, human-validated DiaBENCH cases, performance results are as follows:
| Model | Acc | FTR | TAR |
|---|---|---|---|
| Llama-3.2-DiaFORGE-3B | 0.80 | 0.08 | 0.06 |
| Llama-3.3-Nemotron-DiaFORGE-49B | 0.89 | 0.06 | 0.03 |
| GPT-4o-20241120 (CAPO-prompted) | 0.62 | 0.02 | 0.36 |
| Claude-3.5-Sonnet (optimized) | 0.39 | 0.03 | 0.55 |
DiaFORGE models exhibit tool-calling accuracy gains of +27 pp over GPT-4o (e.g., 0.89 vs 0.62) and +49 pp relative to Claude-3.5-Sonnet (0.89 vs 0.39) with optimized prompting. No significance tests are reported beyond CAPO’s internal paired -tests for prompt tuning, but observed differences substantially exceed typical evaluation noise.
5. Dataset Release and Implementation Blueprint
The DiaFORGE dataset is publicly available and includes approximately 5,000 production-grade API specifications, validated multi-turn dialogues with disambiguation focus, private reasoning traces, user-proxy personas, goals, value maps, and corresponding validation scripts. The release is hosted at https://huggingface.co/datasets/sap-ai-research/diaforge-utc-r-0725.
The reproducibility steps are:
- Seed the API catalogue and persona hub (§2.1).
- Run UTC-Gen for dialogue synthesis or augmentation.
- Apply turn-slicing and LoRA-based SFT on chosen models.
- Evaluate using the DiaBENCH evaluation harness and metric scripts.
6. Significance, Limitations, and Future Research
DiaFORGE’s pipeline—combining disambiguation-centric synthetic data and dual-output fine-tuning—demonstrates substantial accuracy and robustness benefits over generic or proprietary LLMs in realistic, goal-based enterprise tool-calling environments. Dynamic evaluation with DiaBENCH exposes practical, multi-turn failure modes that static benchmarks do not reveal, including error cascades and premature tool calls.
Open research questions include extending DiaFORGE to multi-tool workflows, integrating RLHF/RLAIF for self-correction, and automating or reducing reliance on human intervention during dynamic evaluation. As all code and data are openly shared, DiaFORGE provides a practical, reproducible methodology for constructing, training, and validating LLM agents that must reliably identify, select, and invoke APIs in the presence of semantic ambiguity and incomplete input (Hathidara et al., 4 Jul 2025).