Reasoning-Then-Tool-Call Framework

Updated 3 February 2026

The paper introduces the Reasoning-Then-Tool-Call paradigm that decouples meta-level planning from object-level execution to enhance LLM performance.
It demonstrates a structured methodology where LLMs plan tool calls systematically, yielding improved accuracy and robust error handling across diverse domains.
Empirical evaluations show high tool-call precision and recall, emphasizing the paradigm’s potential in complex tasks like tabular Q&A, bioinformatics, and agentic reasoning.

A Reasoning-Then-Tool-Call paradigm is a structured approach in LLM systems in which an explicit separation is maintained between the process of decomposing complex tasks into high-level plans (meta-level reasoning), and the subsequent execution of these subcomponents via external tools (object-level reasoning). This architecture operationalizes LLM reasoning as “thinking first, then acting,” where “acting” typically involves orchestrated API calls, symbolic computation, data retrieval, or other environment interactions. The paradigm supports faithful multi-step question answering, robust decision making, and principled error handling, and is realized across diverse domains—including tabular data, bioinformatics, multimodal, mathematical, and agentic tool-calling environments.

1. Core Principles: Meta-Level and Object-Level Reasoning

Reasoning-Then-Tool-Call frameworks distinguish between two intertwined levels of computation:

Meta-level reasoning refers to high-level planning: breaking down a user query $Q$ into a sequence of sub-tasks $\langle t_1, \text{arg}_1\rangle, \ldots, \langle t_n, \text{arg}_n\rangle$ , where each $t_i$ is the identifier of a tool and $\text{arg}_i$ its arguments. This stage synthesizes a plan $P(Q)$ leveraging chain-of-thought and tool selection logic.
Object-level reasoning entails executing the planned sub-tasks. Each $\langle t_i, \text{arg}_i\rangle$ is dispatched to the corresponding tool $t_i$ , producing output $o_i = t_i(\text{arg}_i)$ . These outputs inform subsequent meta-level steps or are aggregated into the final answer.

Formally, the interaction can be notated as:

Meta-planning: $P = \text{Meta}(Q)$
Execution: $o_i = t_i(\text{arg}_i)$ for all $\langle t_1, \text{arg}_1\rangle, \ldots, \langle t_n, \text{arg}_n\rangle$ 0
Composition of the final answer: $\langle t_1, \text{arg}_1\rangle, \ldots, \langle t_n, \text{arg}_n\rangle$ 1

This paradigm is consistent with proof-planning traditions and has been instantiated with explicit “tool calls” in modern LLM frameworks, achieving modularity and interpretability (Ferguson et al., 12 Jan 2026).

2. Algorithmic Instantiation and Evaluation

A prototypical Reasoning-Then-Tool-Call system cycles between model-driven proposal and environment-mediated execution:

$\langle t_1, \text{arg}_1\rangle, \ldots, \langle t_n, \text{arg}_n\rangle$ 6

Evaluation is conducted both on final answer accuracy and on the alignment of the model’s tool call sequence with a reference set of essential actions, which represent the minimal necessary tool invocations for a given task. Precision and recall are computed by matching predicted and essential calls, normalized for semantic equivalence. Further metrics include error rates on tool invocations and fidelity of task decomposition (Ferguson et al., 12 Jan 2026).

Metric	Definition
Final Answer Accuracy	$\langle t_1, \text{arg}_1\rangle, \ldots, \langle t_n, \text{arg}_n\rangle$ 2
Tool-call Precision	$\langle t_1, \text{arg}_1\rangle, \ldots, \langle t_n, \text{arg}_n\rangle$ 3
Tool-call Recall	$\langle t_1, \text{arg}_1\rangle, \ldots, \langle t_n, \text{arg}_n\rangle$ 4
Execution Error Rate	Fraction of tasks with at least one tool call error

3. Application Domains and Task Designs

Multi-hop Tabular Question Answering

The paradigm was exemplified in a suite of twenty question templates over World Bank indicators, requiring decomposition into multiple tool-mediated retrieval, computation, and comparison steps. Each template is mapped to 2–5 sequential tool calls—such as searching indicator codes, retrieving tabular values, and performing arithmetic aggregation—culminating in a final answer (Ferguson et al., 12 Jan 2026).

Bioinformatics

In domains like protein function prediction, Reasoning-Then-Tool-Call takes the form of tightly interleaved chains of biological hypothesis formation and targeted invocation of domain-specific tools. Agents like PFUA maintain explicit state representations, select tools using learned policies, and iteratively update belief states based on tool outputs, producing verifiable, grounded biological explanations (Fan et al., 7 Jan 2026).

RL-Based Agentic Reasoning

Frameworks such as ARTIST treat reasoning and tool-calling within a Markov Decision Process. At each state, the LLM policy selects either an internal reasoning (language) action or an external tool call. Tool calls are treated as first-class actions, and reinforcement learning (outcome and sometimes intra-step rewards) is used to induce precise strategies for when and which tools to call (Singh et al., 28 Apr 2025).

Tool Retrieval and Function-Oriented QA

Systems including CoreThink and ToolDreamer further specialize the paradigm for practical settings involving code execution, file navigation, or large-scale tool databases. Here, external retrievers (trained with LLM-generated hypothetical tool descriptions) offload part of the reasoning, selecting a relevant tool subset before LLM-driven proposal and execution (Bhat et al., 27 Oct 2025, Sengupta et al., 22 Oct 2025).

4. Reinforcement Learning and Policy Optimization

Reinforcement learning methods are prominent in inducing optimal Reasoning-Then-Tool-Call behavior, particularly in multi-turn, function-calling, or hybrid reasoning-agentic tasks.

Policy Space: The action space comprises reasoning tokens and tool-call tokens. The policy $\langle t_1, \text{arg}_1\rangle, \ldots, \langle t_n, \text{arg}_n\rangle$ 5 is updated via GRPO-style clipped policy gradients, often masking tool output tokens (which are not gradients’ targets).
Reward Structure: Rewards can include final answer correctness, explicit reward for valid tool call format and execution, penalties for redundant calls/length or for malformed outputs.
Strategic Emergence: RL policies learn to invoke tools selectively under uncertainty or when chain-of-thought confidence drops, adaptively shifting between internal reasoning and tool use depending on environment feedback (Singh et al., 28 Apr 2025, Wu et al., 8 Oct 2025).

5. Empirical Findings and Performance Analysis

Meta-Reasoning Competence: Strong “reasoning-tuned” models (Qwen 3 32B, Mistral Small) achieve 0.80–0.85 answer accuracy, tool-call precision ≈ 0.9, and recall ≈ 0.8.
Shot Effects: Increasing n-shot tool call examples does not significantly improve accuracy, though it may reduce error rates in specific models.
Error Recovery: Multiple models demonstrate resilience to execution errors, replanning effectively on error feedback.
Object-Level Weakness: Removal of arithmetic tools and reliance on LLM generative numeracy leads to 10–15 percentage point accuracy drops, evidencing persistent difficulties in non-symbolic computation.
Generalization: Across diverse question templates and tool-use settings, model performance shows minor variance and robust generalization—though sensitivity to task complexity and tool set granularity persists (Ferguson et al., 12 Jan 2026).

6. Limitations and Future Directions

Plan Diversity and Richness: Essential-action sets are typically single linear chains; future work should explore richer plan spaces, including gold-plan graphs and metrics of plan diversity.
Partial Observability, Data Uncertainty: Extending Reasoning-Then-Tool-Call to scenarios with missing or unreliable data will require meta-level decision-making about imputation, exploration, and uncertainty propagation.
Online Replanning and Adaptation: Mechanisms for revising plans in light of mid-execution surprises, beyond basic error recovery, remain underdeveloped.
Domain and Tool Set Scope: Current studies focus on structured domains (e.g., tabular data, biology, mathematical code). Systematic exploration in heterogeneous tool environments, code synthesis, and knowledge base querying is necessary.
Expressive Tool Interfaces: Real tasks often permit multiple reasoning paths to a solution; current metrics and frameworks should support this natural flexibility (Ferguson et al., 12 Jan 2026).

7. Broader Context and Significance

Reasoning-Then-Tool-Call operationalizes a clear, auditably interpretable separation of planning (what to do) and acting (how to compute/retrieve), providing a modular interface between LLM “thought” and the external computational world. Its adoption enables LLMs to both leverage and coordinate powerful external resources, mitigate object-level weaknesses, and provide structured artifacts (traces, tool logs) for explainability and debugging. Precision/recall metrics over essential actions complement end-task accuracy—enabling finer-grained, process-aware model evaluation. As this framework expands into ever-wider domains, ongoing development will focus on richer planning, adaptive tool retrieval, online replanning, and domain-general agentic orchestration (Ferguson et al., 12 Jan 2026, Fan et al., 7 Jan 2026, Singh et al., 28 Apr 2025).