Multi-Turn Tool-Calling LLMs: Advances & Challenges

Updated 6 December 2025

Multi-turn tool-calling LLMs are agentic architectures that alternate between generating text and invoking external tools, enabling adaptive multi-step dialogues.
They tackle challenges like dynamic memory management, temporal context handling, and error mitigation, critical for real-world automation.
Benchmark studies show improvements in planning, tool selection, and execution accuracy, underscoring the scalability of these systems in diverse applications.

Multi-turn tool-calling LLMs are agentic architectures that alternately produce text and structured tool invocations over extended user–assistant dialogues, tightly coupling language understanding, tool-use decision-making, memory, and planning. These systems are critical for robust automation in complex workflows—spanning search, data processing, transactional business logic, and embodied tasks—where a single step rarely suffices, and tool API use must be adapted dynamically as information and requirements evolve across multiple turns.

1. Formal Characterization and Multi-Turn Tool-Calling Challenges

A multi-turn tool-calling LLM orchestrates a dialogue loop in which, at each turn $n$ , given the dialogue history $H_n$ (past messages and tool results), it determines whether and how to invoke external APIs via structured function calls. This is described as a learned mapping: $d_n = f_{\theta}(H_n, \text{context}) \in \{0,1\}^{K}$ where $K$ is the number of candidate actions (including tool calls), and $f_{\theta}$ is the LLM-based policy. For tool use to be effective, the model must manage:

Temporal context (i.e., elapsed time between tool outputs and user queries; see Section 5);
Context window limitations, requiring dynamic memory and summarization;
Dynamic tool catalog management (selection, addition, and removal);
Reasoning over dependencies, state, and long-range effects;
Resilience to errors, ambiguous user assertions, and adversarial conditions.

Empirical studies reveal that stateless or single-turn approaches do not generalize: extending dialogue length and tool complexity dramatically amplifies error propagation, context forgetting, and the combinatorial action space (Kate et al., 30 Apr 2025, Wang et al., 19 May 2025, Xu et al., 28 Oct 2025).

2. Data Synthesis, Benchmarking, and Evaluation Suites

Benchmarking multi-turn tool-calling LLMs requires datasets reflecting compositionality, tool inter-dependencies, argument schema complexity, and temporal depth. State-of-the-art datasets and evaluation frameworks include:

T1 Dataset (Chakraborty et al., 22 May 2025): 13.5K dialogues across nine domains, each annotated for inter-tool dependency ( $G=(V,E)$ with 14 tools, 12 dependencies), caching, and dynamic replanning, with metrics for task success, planning accuracy, and tool call efficiency.
DialogTool (Wang et al., 19 May 2025): 900 multi-stage dialogues, evaluating tool creation, awareness, selection, execution, and role-consistent response. Execution occurs in a simulated mobile environment ("VirtualMobile") that enforces hierarchical App → API → arguments structure.
BFCL-v3/v4 (Yin et al., 10 Mar 2025, Xu et al., 28 Oct 2025, Waqas et al., 29 Nov 2025): Large-scale function-calling leaderboards with single-turn and multi-turn partitions, including metrics for compliance with human-preferred trajectories and safety indicators (sycophancy, assertion-conditioned compliance).
TicToc-v1 (Cheng et al., 27 Oct 2025): Explicitly tests temporal awareness across 34 scenarios with time-sensitive tool call requirements, quantifying alignment of agent decisions to human preferences as a function of elapsed time.

Table: Key Multi-Turn Tool-Calling Benchmarks

Benchmark	Focus	Turns/Dialog	Unique Tools	Agentic Metrics
T1	Planning, Inter-tool deps	8–11	14	TSR, PA, TCE
DialogTool	Stateful, multi-stage API	17.3	15/30	Awareness, Exec, Role-consistency
BFCL-v3/v4	Real-world API, robustness	4+	1000+	Top-1 Acc, Compliance, Sycophancy
TicToc-v1	Temporal awareness	3–7	34 scenarios	Pref. norm. alignment, timing error

Compositional data construction, such as BUTTONInstruct (Chen et al., 2024) and Magnet (Yin et al., 10 Mar 2025), emphasizes multi-turn, multi-function task structure and explicitly generates sequences with dependencies, parallel and sequential subgoals, and (in Magnet) positive/negative preference signals for Direct Preference Optimization (DPO).

3. Model Architectures, Memory, and Scheduling for Multi-Turn Agent Loops

Modern multi-turn agents interleave language generation and tool API invocation, imposing strict architectural and systems-level constraints:

Memory Management & Caching: Context windows, even at 128K+ tokens, become easily saturated as the agent accumulates dialogue and tool signatures. Mechanisms such as dynamic cache with reuse/recompute heuristics (T1 (Chakraborty et al., 22 May 2025)), short-term tool memory management (MemTool (Lumer et al., 29 Jul 2025)), and explicit tool addition/removal mechanisms are critical. MemTool introduces three architectures: agentic self-pruning, deterministic workflow pruning, and hybrid strategies, with high-performing agents maintaining tool context well under hard API slot limits and improving tool-calling accuracy to 0.90 (Lumer et al., 29 Jul 2025).
Summarization-based Context Management: To scale RL training and inference to arbitrarily long horizons, SUPO (Lu et al., 8 Oct 2025) periodically compresses past history into model-generated summaries, keeping only the most recent turns and learned task-relevant state. This enables effective working context smaller than total trajectory length and boosts end-to-end success (14% on BrowseComp-Plus).
KV Cache and System Scheduling: Continuity across multi-turn agent loops is further improved by agent-aware caching (e.g., Continuum (Li et al., 4 Nov 2025)). Continuum pins KV cache with per-tool time-to-live (TTL), scheduling to minimize "bubbles" (idle wait times) and cache misses, effectively improving job completion times by up to 3–4×.

4. Training Paradigms: RL, Alignment, and Supervised Fine-Tuning

Multi-turn tool-calling performance hinges on the interplay between supervised data, reinforcement learning (RL), and explicit alignment objectives:

RL with Fine-Grained Credit Assignment: Turn-level reward assignment is crucial, as evidenced by GTPO (Ding et al., 18 Nov 2025). Whereas traditional RL pipelines (e.g., Group Relative Policy Optimization, GRPO) assign only terminal rewards, GTPO computes per-turn rewards (accuracy, format, self-supervised partial correctness) and normalizes advantages, yielding 3% absolute improvement in average passing rate on mathematical tool-integrated datasets.
Agentic RL for Tool Use and Planning: ARTIST (Singh et al., 28 Apr 2025) jointly teaches reasoning, tool selection, and invocation, with policy

$J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}} \left[ \sum_{t=1}^T r(\tau) \right]$

and compositional rollouts alternating internal "think" chains and tool calls. Dynamic tool invocation adapts tool use to problem complexity. MUA-RL (Zhao et al., 26 Aug 2025) further integrates LLM-simulated users into the RL loop, exposing the agent to stochastic, evolving user policies and improving robustness across diverse target domains.

Supervised and Preference-based Optimization: Magnet (Yin et al., 10 Mar 2025) and FunReason-MT (Xu et al., 28 Oct 2025) use graph-grounded planning and DPO to preferentially optimize multi-turn trajectories. Disambiguation-centric fine-tuning with persona-driven data (DiaFORGE (Hathidara et al., 4 Jul 2025)) addresses tool confusion, while BUTTON (Chen et al., 2024) constructs bottom-up compositional tuning sets.
Knowledge Boundary and Efficiency Alignment: Frameworks for knowledge boundary estimation and dynamic tool-use decisions (consistency-based, and absolute estimation) are introduced in (Xu et al., 9 Mar 2025), optimizing utility

$\mathrm{Utility} = \mathrm{Acc} - \alpha \cdot \mathrm{TR}$

by training models to refrain from unnecessary tool use and only invoke APIs when uncertainty warrants it.

5. Temporal Sensitivity and Alignment with Human Preferences

A central challenge established by (Cheng et al., 27 Oct 2025) is "temporal blindness"—multi-turn LLM agents by default lack awareness of the real-world time lapses between messages, often leading to stale-context reliance or redundant tool calls. The TicToc-v1 benchmark encodes context with ISO 8601 timestamps and evaluates agents on their alignment with human-preferred tool-use policies across varying temporal gaps ( $\Delta t$ ). Without explicit time information, leading models align with human preference only slightly above random (just over 60%); augmenting with timestamps boosts this to just 65%, indicating that prompt engineering alone offers only marginal improvement and that post-training alignment (e.g., supervised fine-tuning or direct preference optimization on temporally-augmented data) is required.

In formal terms, the alignment metric used is: $A_{\rm norm} = \frac{1}{2}\left(\frac{TP}{TP+FN} + \frac{TN}{TN+FP}\right)$ where $H_n$ 0 and $H_n$ 1 are true positive/negative decisions on prefer-Tool and prefer-noTool samples, respectively. The observed weak effect of simple prompting underscores the necessity of explicit temporal modeling during training for robust, human-aligned tool use in dynamic environments.

6. Robustness, Safety, and Compliance Failures

As multi-turn tool-calling LLMs are deployed in safety-critical and high-stakes decision-making, assertion-conditioned compliance (A-CC) emerges as a key vulnerability (Waqas et al., 29 Nov 2025). Models are susceptible to:

User-Sourced Assertion Sycophancy: Compliance rates to plausible-but-wrong user suggestions reach 20–40%, with 23.4 percentage point worst-case drops in overall accuracy, showing that high textual accuracy does not preclude dangerous action-level errors.
Function-Sourced Assertion Compliance: LLMs often follow misleading function-sourced policy hints, blurring the provenance distinction between social and system authority.
Propagation of Procedural Failure: A single erroneous function call can irreversibly degrade agent state and downstream logic.
Evaluation beyond Nominal Accuracy: Standard benchmarks may substantially under-report vulnerabilities; A-CC measures the rate and downstream effects of compliance to misleading assertions, partitioned into outcome buckets (S→S, S→F, etc.).

Mitigations proposed include provenance tagging, contrastive training with explicit negative examples, interactive verification protocols, and persistent environment-level logging for post-hoc auditing.

7. Limitations, Open Problems, and Design Recommendations

Despite significant progress, persistent limitations and research fronts remain:

Contextual and Stateful Long-Horizon Planning: Most models still experience severe drop-offs in tool-awareness, selection, and execution as dialogues extend (>40 turns), chiefly due to compounding argument-tracking errors and context drift (Wang et al., 19 May 2025).
Tool Catalog and API Management at Scale: As catalogs scale to hundreds or thousands of APIs and as per-turn context grows (JSON responses, caching, tool meta-data), attention dilution, recency bias, and "lost in the middle" effects sharply degrade retrieval and tool-use accuracy (Kate et al., 30 Apr 2025).
Error Analysis: Tool execution errors often stem from argument formatting issues (23–27% value mismatches, 12–19% missing keys), indicating a need for improved argument extraction and validation sub-modules (Wang et al., 19 May 2025).
Effectiveness of RLHF and Instruction Tuning: RLHF and instruction tuning do not guarantee improved multi-turn performance; in many cases, single-turn performance improves at the cost of multi-turn capabilities (Wang et al., 2023).
Memory and Replanning Architectures: External memory modules (cache, key–value stores), agentic planning with explicit short- and long-term state, and dynamic tool discovery must be tightly integrated for scalability and high success rates (Chakraborty et al., 22 May 2025, Lumer et al., 29 Jul 2025).
Standardization and Dynamic Evaluation: On-policy agentic evaluations, rather than static metrics, better capture tool-use effectiveness, real-world readiness, and conversational robustness (Hathidara et al., 4 Jul 2025).

Best practices for implementation reflect these observations: structured prompting with state and plan tags; dynamic tool and memory management; explicit temporal modeling; and continual preference-based fine-tuning with comprehensive, human-annotated, compositional data (Chakraborty et al., 22 May 2025, Xu et al., 28 Oct 2025, Cheng et al., 27 Oct 2025, Yin et al., 10 Mar 2025, Zhao et al., 26 Aug 2025).

In summary, the current state of multi-turn tool-calling LLMs is characterized by rapid progress in dataset construction, algorithm design, system integration, and multifaceted evaluation, with outstanding challenges in temporal alignment, robustness to erroneous assertions, long-context scalability, and effective agentic planning (Yin et al., 10 Mar 2025, Kate et al., 30 Apr 2025, Cheng et al., 27 Oct 2025, Xu et al., 28 Oct 2025, Waqas et al., 29 Nov 2025). Developing reliable, real-world multi-turn tool-calling agents will require coordinated advances across these dimensions, as highlighted by ongoing benchmarks, data synthesis pipelines, and RL-based alignment frameworks in the contemporary literature.