Papers
Topics
Authors
Recent
Search
2000 character limit reached

Interleaved Thinking and Tool Use

Updated 28 January 2026
  • Interleaved thinking and tool use are defined as a method where internal reasoning steps alternate with explicit external tool invocations, improving task efficiency and reliability.
  • The paradigm employs a formal structure based on discrete alternation, often modeled as a Markov Decision Process, to integrate tools like DSL commands, image modifiers, and code editors for verifiable outcomes.
  • Empirical validations across domains such as code repair, multimodal reasoning, and mathematical problem solving demonstrate significant accuracy gains and improved self-correction capabilities.

Interleaved Thinking and Tool Use refers to a paradigm in which language and multimodal models dynamically alternate between internal reasoning steps (“thinking”) and invoking external tools (“tool use”), forming a tightly-coupled chain of cognitive and physical actions. This strategy improves both the reliability and efficiency of complex reasoning tasks, enabling models to ground predictions in verifiable, external computations or perceptual evidence. By reducing the action space and providing dense, per-turn feedback, interleaving tools and reasoning allows for effective learning—especially in smaller models—and circumvents the inefficiencies or brittleness of free-form chain-of-thought alone.

1. Methodological Foundations: Interleaving Protocols and Formalization

Interleaved thinking and tool use can be cast as an agentic control process, typically formalized as a Markov Decision Process (MDP) or a sequence model alternating between reasoning and tool-invocation tokens. In recent systems, this alternation is realized via:

  • Discrete alternation between “thought” and “action” steps, with each action representing either an internal reasoning statement or an explicit tool invocation (e.g., DSL command, code block, JSON call) (Rainone et al., 7 Jul 2025, Chen et al., 29 Dec 2025, Chen et al., 2023).
  • The model observes the current state (environment, context, tool feedback), emits a control token, the tool executes and returns a new observation, and the cycle repeats until a termination condition (solved or abort).
  • Action spaces are often rigid (small DSL, fixed API schemas) to facilitate stable RL and verifiable outcomes (Rainone et al., 7 Jul 2025).

This formal structure manifests in varied application domains:

System Reasoning Representation Tool Use Modality Turn Alternation
CoE (Rainone et al., 7 Jul 2025) Edit DSL commands Code editor 1 DSL/action per turn
MindWatcher Natural language/JSON Multimodal tools Token-by-token
ReTool NL tokens, <code> tags Live code exec NL ↔ code
DeepEyes CoT text, crop tool call Image crop/zoom Text ↔ tool

In all cases, the interleaving is emergent from the model policy: the agent decides at each step whether to “think” or “act,” and this policy is trained with strong feedback tied to external tool outcomes.

2. Action Space Design and Tool Interfaces

The action space for tool use is commonly intentionally constrained to optimize learning and safe exploration:

  • Highly-structured DSLs or API schemas: Limited sets of editor commands (add, replace, delete, exit); standardized JSON arguments for multimodal or search tools; code blocks for interpreters (Rainone et al., 7 Jul 2025, Chen et al., 29 Dec 2025).
  • Tool interface as a stateful environment: Editors, image crop tools, code sandboxes, retrieval engines, or segmenters are invoked and their outputs appended to the model context. For instance, after each code edit, the executor runs tests and appends feedback; for images, new crops are featurized and injected (Rainone et al., 7 Jul 2025, Zheng et al., 20 May 2025).
  • Observation encoding: Execution traces, test pass/failures, error messages, segmented images, or retrieved documents become part of the next turn’s context.
  • Termination protocol: Explicit “EXIT”, <answer>, or maximum turn limits enforce bounded trajectory lengths.

This structure ensures dense, per-turn reward signals, rapid credit assignment, and facilitates efficient sampling during RL or SFT.

3. Learning Frameworks: SFT, RL, and Verifiable Rewards

State-of-the-art interleaved paradigms rely on hybrid training protocols with rigorous reward and loss structures:

Ablations confirm that RL with verifiable and/or adaptive rewards substantially improves tool-use strategy and overall performance, especially in small to mid-sized models.

4. Applications and Empirical Impact

The interleaved paradigm has been deployed across diverse reasoning and perception tasks:

  • Code repair (CoE protocol): Small LMs (1–3B) trained with interleaved editor commands outperform both direct-answer and free-form CoT baselines in pass@1 metrics on MBPP-style Python repair, showing robust error recovery and self-correction (Rainone et al., 7 Jul 2025).
  • Multimodal reasoning (MindWatcher, DeepEyes, Simple o3, VTool-R1): Interleaved vision–language chains improve grounding, fine-grained perception, and chart/table reasoning, often outperforming larger, text-only models and closed-source baselines (Chen et al., 29 Dec 2025, Zheng et al., 20 May 2025, Wang et al., 16 Aug 2025, Wu et al., 25 May 2025).
  • Mathematical tool use (IMP-TIP, ReTool): Strategic alternation between chain-of-thought and tool/calc/code leads to substantial accuracy gains (e.g., +11.2% GSM8K-Hard, +27% AIME over text-only RL) and reduces arithmetic hallucination (Chen et al., 2023, Feng et al., 15 Apr 2025).
  • Adaptive tool use (AdaTooler-V): Delta-score-driven selective tool utilization cuts inference cost and outperforms both always-tool and text-only policies on multi-modal and video reasoning (Wang et al., 18 Dec 2025).

Quantitative improvements include:

Model/Domain Interleaved Protocol Baseline (Accuracy %) Interleaved (Accuracy %) Gain
Llama-3.2-3B (MBPP Repair) CoE RLVR 6.9 / 12.0 13.8 / 19.0 +6.9 / +7.0
Qwen2.5-VL-7B (HR-4K) DeepEyes iMCoT 68.8 75.1 +6.3
ChatGPT (GSM8K-Hard) IMP-TIP 56.0 65.2 +9.2
AdaTooler-V-7B (V*) AT-GRPO 88.2 89.8 +1.6

5. Decision Mechanisms and Selective Tool Invocation

Recent work explores not just alternating “thinking” and “tool” steps, but also the adaptive gating of tool invocation:

  • Meta-cognition triggers: Learned linear probes project representations to scalar “readiness” scores, with dual thresholds determining whether the model should call an external tool or proceed internally (Li et al., 18 Feb 2025).
  • Adaptive reward scaling: Per-sample tool benefit ΔSi\Delta S_i modulates rewards for tool use, penalizing unnecessary calls and incentivizing only helpful tool use, thus balancing efficiency and performance (Wang et al., 18 Dec 2025).
  • Hard/soft gating: Emergent from next-token logits (softmax over tool vs. non-tool tokens); in some models, explicit “gap” regions or fallback policies are used when the meta-cognition score is ambiguous.

These mechanisms reduce unnecessary tool calls, minimize inference delay, and curb downstream errors, yielding robust and cost-effective tool-augmented agents.

6. Generalization, Limitations, and Future Directions

The interleaved thinking and tool use paradigm demonstrates:

  • Broad generality: It is applicable to code repair, mathematical calculation, visual reasoning, document retrieval, and medical VQA/segmentation (Rainone et al., 7 Jul 2025, Gao et al., 2024, Chen et al., 29 Dec 2025, Jiang et al., 16 Dec 2025).
  • Parameter efficiency: Small and mid-sized models benefit most from constrained action and dense per-turn feedback.
  • Modularity and extensibility: Tool interfaces and reward structures are modular, enabling adaptation to new domains and toolkits.

However, some limitations and open questions remain:

  • Scope is occasionally limited to synthetic or curated tasks; extension to unconstrained open-world scenarios or large tool libraries demands more sophisticated orchestration and planning (Rainone et al., 7 Jul 2025).
  • Compute and inference cost: Despite savings from efficient tool use, multi-turn trajectories with external calls remain expensive (Jiang et al., 16 Dec 2025).
  • Abstraction versus immediacy: Paradigms decoupling reasoning planning from tool execution (“Chain-of-Abstraction”) may further improve latency and generalization (Gao et al., 2024).

Future research may focus on hierarchical planner–executor architectures, continual learning for meta-cognition triggers, scaling multitool libraries, and blending abstract planning with real-time interleaved execution.

7. Representative Algorithms and Exemplars

Key algorithmic templates and cycles include:

1
2
3
4
5
6
7
for t in range(T):
    context = [prompt, state_t]
    action_t = LM.generate_one_DSL_token(context)
    state_{t+1} = apply(action_t, state_t)
    reward_t = task_reward + format_reward
    if action_t == EXIT or all_tests_pass(state_{t+1}):
        break

1
2
3
4
5
6
7
while not done:
    token = LLM.decode(state)
    if token == "<tool_call>":
        obs = ToolEnv.invoke(tool_name, args)
        state.append(obs)
    elif token == "<answer>":
        break

1
2
3
4
5
6
if s_meta < l_no:
    return ANSWER_INTERNALLY
elif s_meta > l_yes:
    return CALL_TOOL
else:
    return fallback_decision()

These patterns underscore the consensus that model–tool alternation underpins modern agentic reasoning pipelines across text, code, and multimodal perception.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Interleaved Thinking and Tool Use.