Papers
Topics
Authors
Recent
Search
2000 character limit reached

DiaFORGE Pipeline: Robust API Invocation

Updated 14 February 2026
  • The DiaFORGE pipeline is a disambiguation-centric three-stage framework that improves LLM reliability in handling near-duplicate APIs and underspecified user inputs.
  • It synthesizes persona-driven, multi-turn dialogues and applies supervised fine-tuning with explicit reasoning traces to guide accurate API selection.
  • Dynamic evaluation via DiaBENCH reveals significant tool-calling accuracy gains over baseline models in complex enterprise scenarios.

DiaFORGE (Dialogue Framework for Organic Response Generation & Evaluation) is a disambiguation-centric, three-stage pipeline designed to enhance the reliability of LLMs in enterprise tool-calling scenarios characterized by near-duplicate APIs and frequent input underspecification. The framework specifically targets the limitations of off-policy, single-turn benchmarks, which fail to expose cascading errors arising from premature or incorrect API calls. By incorporating synthetic, persona-driven multi-turn dialogues, supervised fine-tuning with explicit reasoning traces, and stringent dynamic evaluation, DiaFORGE provides a comprehensive methodology for constructing and validating LLM-based agents capable of robust, goal-directed API invocation in complex enterprise settings (Hathidara et al., 4 Jul 2025).

1. Motivation and Architectural Overview

Modern enterprises expose large tool catalogues with thousands of APIs, many distinguished only by minor variants (e.g., CreateCustomer vs. CreateUser). User requests tend to be ambiguous or lack required arguments, challenging LLMs that must correctly select from highly similar tools and elicit missing information. Premature or incorrect API selection in these contexts can yield cascading failures, an error class not uncovered by static, single-turn off-policy benchmarks.

DiaFORGE addresses these challenges through a pipeline comprising:

  • Synthetic Dialogue Generation (Stage I) using persona-driven, multi-turn conversations that force disambiguation and argument elicitation.
  • Supervised Fine-Tuning (Stage II) of open-source LLMs, tightly coupling model outputs with reasoning traces.
  • Dynamic Evaluation Suite (Stage III; DiaBENCH) deploying the model in an agentic loop to measure real-world robustness and goal completion under live conditions.

2. Synthetic Dialogue Corpus Generation

2.1 Identifying Near-Duplicate Tools and Underspecified Arguments

The process begins with an enterprise tool catalogue

$\mathcal{T} = \{\,\tau_i=(\name_i,\desc_i,\params_i)\}_{i=1}^{|\mathcal T|}$

Each tool τi\tau_i is associated with a JSON-Schema parameter set and a set of required slots R(τi)\R(\tau_i). To construct realistic distractor sets, a frozen sentence encoder ϕ\phi embeds tool descriptions. For a seed tool τ\tau^\star: $\D_k(\tau^\star) = \mathop{\mathrm{arg\,top\text{-}k}_{\tau\neq\tau^\star}} \langle\,\phi(\tau^\star),\,\phi(\tau)\rangle$ The assistant is provided candidate set $\C_k(\tau^\star)=\{\tau^\star\}\cup\D_k(\tau^\star)$.

Argument underspecification is enforced by incrementally revealing a subset of the gold slot-value map, $\V^\star=\{(r_i,v_{r_i})\mid r_i\in\R(\tau^\star)\}$, such that the assistant must actively elicit the remaining required values.

2.2 Dialogue Synthesis via UTC-Gen

Dialogue synthesis alternates turns between a user-proxy agent (instantiated with a latent goal gg and persona pp) and the assistant. At each turn tt:

  • $u_t\sim P_{\theta_u}\bigl(\cdot\mid\tau^\star,p,g,\D_k,\V^\star,\h^u_t\bigr)$
  • $a_t\sim P_{\theta_a}\bigl(\cdot\mid \C_k,\h^a_t\bigr)$

Interaction continues until the assistant issues a schema-conformant tool call to τ\tau^\star using exactly $\V^\star$, or t=Tmaxt=T_{\max}. The process is formalized in the pseudocode:

1
2
3
4
5
6
7
8
9
10
hu, ha = [], []
for t in range(1, Tmax+1):
    ut = sample_user(τ*, P, G, Dk, V*, hu)
    hu.append(ut)
    ha.append(ut)
    at = sample_assistant(Ck, ha)
    ha.append(at)
    if is_valid_call(at, τ*, V*):
        break
validate_and_accept(hu, ha)

2.3 Validation Cascade

To ensure data quality, each candidate dialogue is screened through:

  • Format Validator: Enforces alternating turns and correct JSON stub structure.
  • Tool-Call & Tool-Args Validator: Confirms correct API selection and full required argument coverage.
  • Relevancy Validator & LLM Critique: Ensures persona coherence and proper dialogue structure.

Dialogues failing any validation gate are discarded.

3. Supervised Fine-Tuning with Reasoning Traces

3.1 Model Families and Parameterization

The corpus of N5,000N \approx 5{,}000 validated multi-turn dialogues (13,649\approx 13,649 assistant turns) is used to fine-tune open models:

  • Llama-3.2-3B
  • Gemma-3-4B, 12B, 27B
  • Llama-3.3-Nemotron-Super-49B
  • Llama-3.3-70B

3.2 Reasoning Trace Integration

Assistant turns comprise > … (private reasoning) plus public response (with optional "tool_calls" stubs). Both are supervised during fine-tuning, requiring the model to produce both a private chain-of-thought and a public-facing utterance.

3.3 Training Setup

For each assistant turn tt in dialogue did_i, input/output pairs (xi,t,yi,t)(x_{i,t}, y_{i,t}) are constructed: xi,t=[SYS]  u1(i)  a1(i)    ut(i),yi,t=at(i)x_{i,t} = [\mathrm{SYS}]\;u^{(i)}_1\;a^{(i)}_1\;\dots\;u^{(i)}_t,\quad y_{i,t} = a^{(i)}_t

  • Loss masking is applied so that gradients are incurred only on yi,ty_{i,t} tokens.
  • Fine-tuning uses Low-Rank Adaptation (LoRA) with rank r=16r=16, α=16\alpha=16.
  • Training proceeds for a single epoch, using 8-bit precision, batch size 1.
  • Optimizer: AdamW with peak learning rate 1×1041\times10^{-4}, cosine schedule.
  • Objective: next-token negative log likelihood

L(ϕ)=i=1Nt=1Tijgen_tokenslogpϕ(yi,tjxi,t,yi,t<j)\mathcal L(\phi) = -\sum_{i=1}^N\sum_{t=1}^{T_i}\sum_{j\in \text{gen\_tokens}} \log p_\phi\bigl(y_{i,t}^j\mid x_{i,t},\,y_{i,t}^{<j}\bigr)

4. Dynamic Evaluation with DiaBENCH

4.1 Live Agentic Testing Loop

Each fine-tuned model fϕf_\phi is tested by deploying it as the assistant in the UTC-Gen loop, with a frozen user-proxy agent PθuP_{\theta_u}. At every dynamic turn tt, the assistant observes the cumulative dialogue history $\hat{\h}^a_t$ and candidate set $\C_k$; the user-proxy samples three candidate responses, using GPT-4o as a voter to minimize hallucination, and the interaction progresses until completion or truncation.

4.2 Metrics and Benchmarking

DiaBENCH provides both static (off-policy) and dynamic (on-policy) metrics:

  • Tool-Calling: Accuracy Rate (Acc), False-Positive Tool-Call Rate (FTR), Tool-Call Abstention Rate (TAR), tool-level and argument-level Precision/Recall (TCP/TCR, PKP/PKR).
  • Conversational: Relevancy via LLM rubric (ConvRel), Type–Token Ratio (TTR), nn-Gram Diversity (NGDn_n).

On 119 out-of-domain, human-validated DiaBENCH cases, performance results are as follows:

Model Acc FTR TAR
Llama-3.2-DiaFORGE-3B 0.80 0.08 0.06
Llama-3.3-Nemotron-DiaFORGE-49B 0.89 0.06 0.03
GPT-4o-20241120 (CAPO-prompted) 0.62 0.02 0.36
Claude-3.5-Sonnet (optimized) 0.39 0.03 0.55

DiaFORGE models exhibit tool-calling accuracy gains of +27 pp over GPT-4o (e.g., 0.89 vs 0.62) and +49 pp relative to Claude-3.5-Sonnet (0.89 vs 0.39) with optimized prompting. No significance tests are reported beyond CAPO’s internal paired tt-tests for prompt tuning, but observed differences substantially exceed typical evaluation noise.

5. Dataset Release and Implementation Blueprint

The DiaFORGE dataset is publicly available and includes approximately 5,000 production-grade API specifications, validated multi-turn dialogues with disambiguation focus, private reasoning traces, user-proxy personas, goals, value maps, and corresponding validation scripts. The release is hosted at https://huggingface.co/datasets/sap-ai-research/diaforge-utc-r-0725.

The reproducibility steps are:

  1. Seed the API catalogue and persona hub (§2.1).
  2. Run UTC-Gen for dialogue synthesis or augmentation.
  3. Apply turn-slicing and LoRA-based SFT on chosen models.
  4. Evaluate using the DiaBENCH evaluation harness and metric scripts.

6. Significance, Limitations, and Future Research

DiaFORGE’s pipeline—combining disambiguation-centric synthetic data and dual-output fine-tuning—demonstrates substantial accuracy and robustness benefits over generic or proprietary LLMs in realistic, goal-based enterprise tool-calling environments. Dynamic evaluation with DiaBENCH exposes practical, multi-turn failure modes that static benchmarks do not reveal, including error cascades and premature tool calls.

Open research questions include extending DiaFORGE to multi-tool workflows, integrating RLHF/RLAIF for self-correction, and automating or reducing reliance on human intervention during dynamic evaluation. As all code and data are openly shared, DiaFORGE provides a practical, reproducible methodology for constructing, training, and validating LLM agents that must reliably identify, select, and invoke APIs in the presence of semantic ambiguity and incomplete input (Hathidara et al., 4 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DiaFORGE Pipeline.