Iterative Tool Use and Extensibility

Updated 25 February 2026

Iterative tool use and extensibility is an emerging paradigm that enables LLMs and agents to sequentially invoke, integrate, and register external tools within a dynamic decision process.
It employs techniques like multi-agent coordination, reinforcement learning, and curriculum strategies to optimize tool selection and performance in complex scenarios.
Empirical studies demonstrate that these approaches boost accuracy, reduce operational costs, and enhance autonomous reasoning across diverse benchmarks.

Iterative tool use and extensibility refer to a class of methods, system architectures, and learning strategies that enable LLMs and multimodal agents to invoke external tools in a multi-step, context-sensitive, and compositional manner, and to seamlessly assimilate new tools or tool-use strategies without architectural overhaul or retraining. This paradigm is central to the construction of autonomous agents capable of complex reasoning, real-world task completion, and open-ended adaptation in evolving tool ecosystems. It integrates algorithmic innovations including multi-agent coordination, reinforcement learning, program synthesis, curriculum learning, and meta-level reflection, underpinned by empirical studies showing the necessity of diverse, iterative tool interactions for state-of-the-art performance across benchmarks.

1. Formal Definitions and Core Algorithms

Iterative tool use is formalized as a sequential decision process, wherein an agent maintains a state $S_t$ encoding the dialogue or problem history and intermediate outputs, and at each turn selects an action $a_t$ —either a language generation step or a tool call—from a dynamically maintained or extensible toolset $T_t$ (Cho et al., 13 Jan 2026, Li et al., 29 Dec 2025). Mathematically, this process is:

$(S_t, T_t) \xrightarrow{f} (a_t, T_{t+1})$

where $f$ may be a policy $\pi_\theta$ in reinforcement learning settings or a program synthesizer in code-generation frameworks.

The general iteration loop takes the form:

Input: Problem specification $q$ (e.g., user question) and current $T_t$ .
Selection: Action $a_t$ determined by $f(S_t, T_t)$ .
Execution: If $a_t$ is a tool call, invoke with arguments and receive output $o_t$ .
Update: Advance to $S_{t+1}$ and possibly expand $T_{t+1}$ to accommodate new tools or tool-usage templates.

Extensibility is achieved by modular representations of tools (e.g., function signatures, JSON schemas) and dynamic tool registration mechanisms, permitting "plugin" integration or auto-discovery by agents (Shi et al., 2024, Deng et al., 31 Oct 2025, Cho et al., 13 Jan 2026).

2. Multi-Agent and Ensemble Architectures for Tool-Use Diversity

Heterogeneous ensemble frameworks such as TUMIX orchestrate $K$ parallel agents ( $s_1,...,s_K$ ), each pursuing distinct tool-use strategies (e.g., chain-of-thought, code interpreter, web search, hybrid) (Chen et al., 30 Sep 2025). TUMIX operates in iterative rounds:

Each agent produces an initial answer $a_i^{(1)}$ conditioned solely on the question.
Answers are shared across agents; in subsequent rounds, each agent refines its answer using both the question and the aggregate answer set from the previous round.
An LLM-based "judge" assesses answer convergence, invoking an early-stopping criterion if consensus suffices.
The final answer is aggregated via majority voting or weighted ensembling.

Key formalism includes soft-ensemble weights: $A^{(t)} = \sum_{i=1}^K w_i^{(t)} a_i^{(t)}$ with weights $w_i^{(t)}$ updated via a softmax over answer quality scores. Diversity of agent pathways and tool-use encouragement at each round are empirically validated to boost coverage and accuracy.

3. Reinforcement and Curriculum Learning for Strategic Iterative Tool Use

Reinforcement learning is leveraged to optimize tool invocation policies over multi-turn trajectories. Notable formulations include:

ReTool's MDP: $\pi_\theta$ interleaves natural language and code execution steps, with environment state $s_t = (q, o_{<t}, f_{<t})$ , action space including both text and tool invocation tokens, and sparse terminal rewards based on answer correctness (Feng et al., 15 Apr 2025).
AdaReasoner/InfTool GRPO: Policies are optimized via Group Relative Policy Optimization (GRPO), where batches of trajectories are ranked by end-task rewards, and policy gradients are computed relative to the group mean (Li et al., 29 Dec 2025, Song et al., 26 Jan 2026).

Curriculum-based strategies (e.g., Confucius, iTool) expose the model successively to:

Simple (ground-truth) toolsets ("warm-up"),
In-category distractors,
Full cross-category tool libraries, with iterative self-instruct phases targeting model uncertainties (Gao et al., 2023, Zeng et al., 15 Jan 2025).

Monte Carlo Tree Search (MCTS)-based exploration, direct preference optimization, and introspection-driven data augmentation are used to uncover and rectify step-level ("fragment") deficiencies in tool-use trajectories, directly targeting observed process errors and enabling generalization to complex or novel tool configurations (Zeng et al., 15 Jan 2025, Gao et al., 2023).

4. Plug-and-Play Extensibility and Dynamic Tool Registration

Horizontal extensibility is realized through standardized tool registries that support dynamic addition (and removal) of tools, often via:

Auto-parsed documentation (OpenAPI, REST) transformed into machine-interpretable function specifications (Shi et al., 2024).
Simple plugin interfaces: new tools require a name, function handle (invoke: Q \to A), and description (Deng et al., 31 Oct 2025).
Dynamic tool generation components or LLM-driven in-context "tool makers" capable of proposing new tool schemas and usage patterns, filtered via domain-specific scoring rubrics and integrated into the registry, as in user- and system-oriented simulators (Cho et al., 13 Jan 2026, Li et al., 29 Dec 2025).

Empirical evidence demonstrates that such modularity enables rapid onboarding of previously unseen APIs: black-box probing and auto-documentation successfully enable execution of arbitrary new endpoints as long as documentation conforms to known schemas (Shi et al., 2024), and models trained with randomized tool identifiers and paraphrased descriptions exhibit strong zero-shot generalization (Song et al., 26 Jan 2026).

5. Empirical Results, Coverage, and Cost Analyses

Iterative, extensible tool-use frameworks consistently outperform static or single-agent approaches on standard leaderboards:

TUMIX: Delivers +3.55% over best tool-augmented scaling baseline on Gemini-2.5-Pro, exceeding 32% accuracy on HLE and up to 96.7% on AIME with cost reductions to ~49% via adaptive halting (Chen et al., 30 Sep 2025).
AutoTools: Demonstrated 89% and 79% success rates on RestBench-TMDB and Spotify, respectively, and 60% on ToolFlow—surpassing all prior LLM-agent frameworks (Shi et al., 2024).
InfTool: Boosted Qwen2.5-32B performance from 19.8% to 70.9% on BFCL, entirely with synthetic data and achieving parity or superiority versus much larger proprietary systems (Li et al., 29 Dec 2025).
AdaReasoner: Increased open-source MLLM performance by +24.9%, with zero-shot tool extensibility yielding accuracy gains of +45% or more on new tasks and tools (Song et al., 26 Jan 2026).

Crucially, ablations reveal that agent diversity, programmatic feedback/reflection, and curriculum escalation are each necessary for reliable generalization and robust acquisition of tool competence at scale.

6. Limitations, Open Challenges, and Future Directions

While iterative tool use and extensibility have established superior empirical performance, current frameworks face recognized constraints:

Simulation-Reality Gap: User simulators lack linguistic ambiguity and breadth, resulting in brittleness under real-world deployment unless supplemented with human-in-the-loop data (Li et al., 29 Dec 2025, Cho et al., 13 Jan 2026).
Context and Memory Bounds: Self-reflection and multi-agent message passing degrade for dialogues exceeding ten turns; memory-augmented solutions are under investigation (Li et al., 29 Dec 2025).
Compositional Complexity: Most contemporary policies support only serial tool calls per turn; parallel and nested tool scheduling remain open problems (Song et al., 26 Jan 2026).
Reward Sparsity: RL approaches (ReTool, AdaReasoner) may suffer from sparse credit assignment; enriched intermediate and compositional rewards are under study (Feng et al., 15 Apr 2025).
Multi-Modality and Generalization: While vision+language interfaces are now robust (ToolScope, AdaReasoner), extension to audio, stateful systems, and on-the-fly schema discovery is only partially addressed (Deng et al., 31 Oct 2025, Song et al., 26 Jan 2026, Li et al., 29 Dec 2025).

Future research directions include hierarchical planner–subagent decompositions, online dynamic retriever learning, robust cross-domain tool generalization studies, and persistent, continual RL for non-stationary, evolving tool ecosystems.

7. Comparative Table: Principal Frameworks and Extensibility Mechanisms

Framework	Tool Modality	Extensibility Mechanism
TUMIX (Chen et al., 30 Sep 2025)	Text, Code, Search	LLM-driven auto-agent design, live tool pool
AutoTools (Shi et al., 2024)	REST, code, black-box	Doc parsing, automatic probing, plugin
ToolScope (Deng et al., 31 Oct 2025)	Vision, Text, Code	Registry interface, one-line plugin
AdaReasoner (Song et al., 26 Jan 2026)	Vision, JSON APIs	Randomized names/descriptions; zero-shot
InfTool (Li et al., 29 Dec 2025)	JSON schema APIs	Multi-agent synthesis, schema-based parsing
Confucius (Gao et al., 2023)	Text, code, APIs	Curriculum + self-instruct feedback loop
iTool (Zeng et al., 15 Jan 2025)	Arbitrary tool chains	MCTS + iterative preference optimization
ReTool (Feng et al., 15 Apr 2025)	Code interpreter	Tag/action extension in RL environment

These approaches, instantiated across language and multimodal agents, operationalize iterative reasoning cycles and seamless extensibility, underpinning the current state-of-the-art in agentic tool use.