Unified Tool-Based Action Space

Updated 11 January 2026

Unified Tool-Based Action Space is a framework that integrates all discrete and parameterized tool actions into one structured domain for learning, planning, and execution.
It overcomes the fragmentation of modality-specific action spaces by unifying policies across perception, language, reasoning, and manipulation.
The approach supports reinforcement, imitation, and multi-agent learning, yielding measurable gains in accuracy, efficiency, and sample efficiency.

A unified tool-based action space is a formalism in which all tool-use actions—across perception, language, reasoning, or manipulation—are encoded within a single, structured domain supporting learning, planning, coordination, and execution. This paradigm originated to overcome the fragmentation of agent policies, tool call APIs, and modality-specific action spaces in vision-LLMs, language agents, robot learning, and multi-agent reinforcement learning. Contemporary research establishes that the unified tool-based action space is foundational for sample-efficient, robust, and extensible agentic systems.

1. Formalization of the Unified Tool-Based Action Space

The unified tool-based action space $A$ is defined as the union of all discrete (and possibly parameterized) actions invoked by an agent, including both primitive operations and calls to specialized tools. In the SpaceTools framework, $A$ is partitioned as follows (Chen et al., 3 Dec 2025): $A = A_\text{tools} \cup A_\text{answer}$ where

$A_\text{tools} = \bigcup_{t\in T} \{ (t, p) \mid p \in \mathcal{P}_t \}$

with $T$ the set of tools, each with its argument space $\mathcal{P}_t$ , and $A_\text{answer}$ the set of terminal answer actions. Individual actions $a_t$ take the form

$a_t = \begin{cases} (\text{tool}=t,\,\text{args}=p) & \text{if invoking tool } t,\;p\in\mathcal{P}_t \ (\texttt{<answer>}:y) & \text{if returning the final answer } y \end{cases}$

This formalism generalizes to language agents (Kim et al., 2024)—where each $a_t = (t_t, s_t)$ combines a tool identifier $t_t$ and a linguistic description $s_t$ —and multi-agent settings (Yu et al., 2024), where $A^U$ fuses all possible agent actions into a superset with semantic grouping.

Action parameterizations cover both low-level primitives (e.g., click, type, scroll), higher-level APIs (e.g., set_cell_values in spreadsheets), and structured arguments for visual, reasoning, or robotic tools (Yang et al., 20 Oct 2025).

2. State, Observation, and Trajectory Models

The state $s_t$ observed by the agent at each turn incorporates the task context, dialogue/action history, and most recent tool outputs. In tool-augmented VLMs (Chen et al., 3 Dec 2025): $s_t = (\mathcal{I}, h_t, o_{t-1})$ where $\mathcal{I}$ encodes the input (image, question), $h_t$ collects the full action–observation dialogue up to $t-1$ , and $o_{t-1}$ comprises the last tool results.

In language agents (Kim et al., 2024), the state is $S_t = (x, h_{t-1})$ with $x$ the instruction and $h_{t-1}$ the sequence of step/output pairs. For multi-agent RL, all agents act in a factored Dec-POMDP with each agent localizing into the global unified action space through available-action masks (Yu et al., 2024).

In all settings, the Markov property is maintained because the full history—action, tool call, and tool return—is present at each step.

3. Learning and Optimization in Unified Tool Spaces

Unified tool-based action spaces enable reinforcement learning (RL), imitation learning, or supervised finetuning to jointly optimize tool selection, sequencing, and execution.

SpaceTools DIRL employs a two-phase RL curriculum (Chen et al., 3 Dec 2025):

Teaching Phase: Mixes SFT with demonstration-regularized RL using both specialist (single-tool) and universal (multi-tool) teachers, optimizing

$\mathcal{L}_{\text{teach}}(\theta) = -\mathbb{E}_{\tau \sim \mathcal{D}} \left[ \sum_{t=1}^{T(\tau)} \log \pi_\theta(a_t|s_t) \right] + \alpha\, \mathbb{E}_{\tau \sim \mathcal{D}}\left[ R(\tau) \right]$

Exploration Phase: Continues RL (Group Relative Policy Optimization), initializing from the previous teacher phase, to refine multi-turn, multi-tool coordination.

Language agents (Husky (Kim et al., 2024), VerlTool (Jiang et al., 1 Sep 2025)) implement similar multi-turn RL or distillation from high-quality tool-augmented teacher trajectories, absorbing both action selection and tool execution policies in a unified architecture.

Hybrid environments such as UltraCUA (Yang et al., 20 Oct 2025) combine GUI primitives and structured tool APIs, translating demonstration trajectories into a singular action space for two-stage SFT and RL over verifiable rewards.

Multi-agent systems (U-QMIX, U-MAPPO (Yu et al., 2024)) share a single neural “UAS-head” across physically heterogeneous agents, using available-action masks and auxiliary losses (CGI) to support partial observability, heterogeneity, and inter-group coordination.

4. Modular, Extensible Tool Integration and System Design

Unified tool-based action spaces are modular by construction. All tools—software APIs, language functions, vision operators, robotic primitives—are abstracted as discrete actions or programmatic calls. Integration mechanisms include:

Token-based tool APIs: Each tool is represented by a reserved vocabulary token and argument signature (Jiang et al., 1 Sep 2025, Yang et al., 20 Oct 2025).
Plugin architectures: Tools conform to a base interface (e.g., parse_action, conduct_action, update_env), permitting dynamic discovery, parsing, execution, and result formatting (Jiang et al., 1 Sep 2025).
Hybrid action selection: The agent network decodes either primitive (e.g., click, type) or tool-invocation tokens; no need for mode switches or specialized heads (Yang et al., 20 Oct 2025).
Decoupled learning/execution: Policies are trained to output over the full action space, with a separate tool server or environment manager handling external execution and state transition.

This modularity enables rapid incorporation of new tools, action types, or modalities—extending agent capabilities with minimal architectural changes (Kim et al., 2024, Yang et al., 20 Oct 2025).

5. Empirical Validation and Performance Impact

Unified tool-based action spaces consistently yield enhanced accuracy, efficiency, and generalization:

SpaceTools achieves 70% accuracy on RoboSpatial-Home (vs. 58% for SFT-only, 54% for tool-free RL, +12%/+16%), 90% on BLINK, 34% pose IoU on BOP-ASK (+33% over tool-free), and 86% real-world robot success (vs. GPT-5+Toolshed 65%) (Chen et al., 3 Dec 2025).
VerlTool shows parity or superiority to domain-specific baselines across six domains, with asynchronous rollouts providing 1.2–2× speed-up and emergent strategic tool use (Jiang et al., 1 Sep 2025).
UltraCUA improves OSWorld success rate (7B model: 27.0% up from 23.4% GUI-only), achieves 21.7% zero-shot generalization to WindowsAgentArena, and demonstrates action/step efficiency—step count reduces from 9.31 (GUI-only) to 8.46 (hybrid) (Yang et al., 20 Oct 2025).
Husky matches or exceeds GPT-4 on mixed-tool tasks (HuskyQA: 20.9% vs. 20.2%) despite orders-of-magnitude smaller parameter count (7B vs. frontier) (Kim et al., 2024).
U-QMIX/U-MAPPO achieve 0.81–0.99 win-rate across difficult SMAC maps, outperforming single-agent baselines or non-unified policy variants (e.g., 0.13 for QMIX on challenging maps) (Yu et al., 2024).

Ablation studies confirm that removing unified action spaces, demonstration RL, or multi-tool optimization degrades model performance by double-digit percentages in accuracy or success rate (Chen et al., 3 Dec 2025). The action unification thus directly explains the observed gains.

6. Structured Action Space for Embodied Manipulation and Robotics

In robotic and embodied domains, unified tool-based action spaces are formalized via continuous-to-discrete tokenizations or task-agnostic pose representations:

FACT (Flow-matching Action Tokenizer) compresses high-dimensional continuous trajectories into sequences of discrete tokens ( $K=4096$ vocabulary, $L=20$ codes per chunk), supporting joint optimization of reasoning and fine-grained motor control (Liu et al., 30 Dec 2025). GenieReasoner, trained with FACT, demonstrates 82.7% reasoning accuracy and 0.71 aggregate manipulation score, with millimeter-level trajectory fidelity.
Task-frame tool action encodes all tool poses in a global static frame ( $T_\text{tool}^\text{task}\in SE(3)$ ), enabling embodiment-agnostic imitation from human demonstrations to robotic replay, and generalizing flawlessly across tool types, kinematics, and sensors (Chen et al., 6 Apr 2025).
Contact-state graphs and action primitives structure in-hand manipulation into reusable skills (detach, crossover, attach), sequencing policies in a broader contact-state transition system for robust tool regrasping (Saito et al., 2024).

These constructions yield generalizable, data-efficient, and extensible policies for complex physical tasks, including dynamic tool-use, cross-embodiment transfer, and high-precision manipulation.

Scalability of unified tool-based action spaces is contingent on effective navigation algorithms:

ToolChain* models the space as a decision tree over sequences of API calls, employing A* search with cost functions $f(n) = g(n) + h(n)$ to balance exploration/exploitation in large and compositional action spaces (Zhuang et al., 2023). ToolChain* outperforms MCTS (+7.35× speed, +3.1–3.5% accuracy) and applies to math, planning, and tool-augmented QA tasks.
Trajectory policies in VerlTool or SpaceTools treat the tool-based action sequence as the primary object of optimization, leveraging asynchrony and multi-turn history to avoid deadlocks and maximize multi-tool compositionality (Jiang et al., 1 Sep 2025, Chen et al., 3 Dec 2025).

A plausible implication is that unified tool-based action spaces—when paired with suitable planning/search—scale gracefully to domains with hundreds or thousands of possible tool calls, supporting robust multi-step reasoning or manipulation.

In summary, unified tool-based action spaces integrate all modalities and operations—perception, language, control, reasoning—into an explicit, extensible, and learnable domain for agentic decision-making. This approach underlies state-of-the-art advances across vision-LLMs, language agents, robotic control, computer-use agents, and multi-agent systems, yielding improved generalization, sample efficiency, and system modularity (Chen et al., 3 Dec 2025, Kim et al., 2024, Chen et al., 6 Apr 2025, Jiang et al., 1 Sep 2025, Yang et al., 20 Oct 2025, Yu et al., 2024, Zhuang et al., 2023, Liu et al., 30 Dec 2025, Saito et al., 2024).