Tool-Centric Inverse Dynamics Model

Updated 2 February 2026

TC-IDM is an innovative framework that integrates generative video planning with tool-centric control to produce accurate 6-DoF robot motions.
It combines analytic geometry and vision-driven learning to extract 3D tool trajectories and gripper commands from RGB-D observations.
Experimental results show TC-IDM improves success rates by 30–50 percentage points over baselines, excelling in tasks including deformable object handling.

The Tool-Centric Inverse Dynamics Model (TC-IDM) is an architectural and algorithmic approach for bridging the gap between high-level generative world models, typically realized as video-based vision-language planners, and the execution of low-level, physically-actionable robot control commands. By anchoring on the imagined trajectory of the robot's tool (i.e., end-effector) generated by a world model, TC-IDM provides a robust and interpretable intermediate representation—enabling the translation of visually planned actions into 6-DoF end-effector motions and corresponding gripper commands, even in scenarios involving novel tools, diverse end-effectors, and previously unseen object interactions, including with deformable materials. TC-IDM combines elements of geometry-grounded inverse dynamics, learned vision-driven grasp policy, and nonparametric modeling, supporting both analytic and data-driven components and facilitating viewpoint invariance and strong generalization properties (MI et al., 26 Jan 2026, Haninger et al., 2019).

1. Formal Definition and Interface

TC-IDM operates at the intersection of generative planning and low-level control. The primary interface consists of:

Inputs:
- Generated RGB video $V_{\text{rgb-gen}} = \{I_{\text{rgb-gen}}^t\}_{t=0}^T$ produced by a video-based world model and accompanied by depth maps and camera poses.
- An initial RGB-D observation and task instruction $L$ .
Output:
- A timed sequence of control commands $u(t) = \left( u_{\text{TCP}}(t), u_{\text{gripper}}(t)\right)$ , where:
- $u_{\text{TCP}}(t) \in SE(3)$ encodes the 6-DoF pose of the tool center point (TCP).
- $u_{\text{gripper}}(t) \in \mathbb{R}$ is the continuous gripper aperture.

The planned trajectory is captured as a cloud of 3D points (waypoints) on the tool, $\tau_{\text{tool}}(t) = \{\mathbf{x}_i^t\}_{i=1}^K$ , extracted from generated video and depth. This tool-centric representation is both more tolerant to scene complexity and directly mappable to robot motions (MI et al., 26 Jan 2026).

2. Mathematical Formulation

TC-IDM comprises both analytic and learned components:

Analytic Geometry-Grounded Inverse Dynamics:
- For each time pair $(t,\,t{+}1)$ , given point correspondences $\mathcal{T}_{\text{gripper}}^t, \mathcal{T}_{\text{gripper}}^{t+1}$ , solve for the rigid transformation:
$(\mathbf{R}_t, \mathbf{p}_t) = \arg\min_{\mathbf{R}\in SO(3),\,\mathbf{p}\in\mathbb{R}^3}\sum_{i=1}^K \left\| \mathbf{R} \mathbf{x}_i^t + \mathbf{p} - \mathbf{x}_i^{t+1} \right\|^2$ - The solution yields the 6-DoF TCP motion $u_{\text{TCP}}(t+1) = (\mathbf{R}_t, \mathbf{p}_t)$ .
Learned Vision-Driven Gripper Control:
- Dense semantic embeddings $f_{\text{dino}}^t$ are extracted using a frozen DINOv3 encoder: $F_{\text{dino}} = \{f_{\text{dino}}^t\}_{t=0}^T$ .
- An MLP "GripperHead" regresses the gripper aperture:
$u_{\text{gripper}}(t) = \mathrm{GripperHead}(f_{\text{dino}}^t)$ - Trained by imitation, the loss is

$\mathcal{L}_{\text{gripper}} = \sum_{t=0}^{T} \left\|u_{\text{gripper}}(t) - u_{\text{gripper}}^\star(t)\right\|^2$

where $u_{\text{gripper}}^\star(t)$ is the ground-truth from demonstrations.
Combined loss (for learned branch):

$\mathcal{L} = \mathcal{L}_{\text{gripper}} + \lambda \mathcal{L}_{\text{TCP}}$

In the canonical TC-IDM, $\mathcal{L}_{\text{TCP}}$ is omitted due to analytic inversion.

3. Architectural Components

The architecture divides the inference and action mapping workflow into modular stages:

Segmentation and 3D Motion Estimation:
- Gripper/tool masks are generated via SAM 3 segmentation of the imagined video frames.
- Conditioned on these masks, a 3D point tracker (SpatialTrackerv2) yields dense point trajectories.
- Rigid-body filtering selects the $K$ point tracks on the tool that best fit a rigid transform, enforcing tool-centricity.
Decoupled Action Heads:
- A geometry-based head analytically computes TCP pose via rigid alignment; this branch is nonparametric and requires no learned weights.
- A vision-driven head operates on high-dimensional DINOv3 embeddings to produce the gripper aperture; this branch is learned via MLP and is supervised by demonstration data.
- The decoupling ensures semantic grasp cues (e.g., open/close, pinch) are distinguished from 3D spatial motion (MI et al., 26 Jan 2026).

4. Operational Paradigm: Planning, Representation, and Translation

Plan-and-Translate:
- Planning: A video world model (e.g., WoW, Cosmos2, Kling) accepts initial state and instruction to generate the future RGB video encapsulating the tool’s full trajectory.
- Representation: This video is transformed into 3D tool-point clouds using segmentation and depth alignment.
- Translation: At each timestep, analytic pose extraction (for $u_{\text{TCP}}$ ) and learned gripper inference synergize to yield a final robot command $u(t)$ , streamed at 100–500 Hz to the robot controller for real-time execution (MI et al., 26 Jan 2026).
Inference Pipeline Summary:

Step	Operation	Method/Component
1	Get initial state	Sensor RGB-D, text
2	Generate video	World model (diff/transformer)
3	Align depth/camera pose	Sensor fusion
4	Segment gripper/tool	SAM 3
5	Track points	SpatialTrackerv2
6	Filter for tool	Rigid-body criterion
7	Extract actions	Analytic + MLP heads
8	Stream $u(t)$	Robot control layer

5. Experimental Evaluation and Comparative Analysis

TC-IDM was evaluated on nine real-world manipulation tasks stratified by difficulty and on zero-shot deformable object manipulations (cloth removal, folding, hoodie folding). Primary metric was task success rate (binary), with results:

Overall (across video models): 61.11%
Simple (easy and medium): 77.7%
Zero-shot deformable: 38.46%

Compared to end-to-end vision-language-action (VLA) baselines (e.g., $\pi_0$ , OpenVLA, RT-2, Octo) and video-conditioned inverse dynamics models (AVDC, VidBot, AnyPos), TC-IDM yielded 30–50pp higher performance on hard tasks, near-perfect on easy cases, and unprecedented zero-shot capability for deformable objects. These results establish TC-IDM as a robust “last-mile” bridge, generalizing to new viewpoints (Apple Pro & D435i), cross-embodiment tasks (single- vs. dual-arm), and unseen cloth handling (MI et al., 26 Jan 2026).

6. Foundations in Nonparametric Tool-Centric Inverse Dynamics

The TC-IDM concept generalizes earlier nonparametric, tool-aware inverse dynamics, as described by Haninger & Tomizuka (Haninger et al., 2019):

In multimodal tool scenarios, explicit clustering via (soft) EM or collapsed Gibbs over Gaussian Process (GP) models associates each tool mode $k$ with a distinct inverse dynamics residual predictor $h_k(x) = \tau - g(q)$ .
The system leverages the tool's identity (mode) and online experience to learn or switch between GPs, with real-time assignment and disturbance (collision) rejection.
Formal passivity is shown by constructing a composite storage function $S(q, \dot{q})$ , guaranteeing the closed loop with impedance control remains passive.
For a new tool, the framework discovers and adapts a new GP mode, while outlier detection mechanisms exclude transient disturbances, maintaining safe feedforward control.

7. Implementation and Practical Considerations

Data Acquisition: Controlled explorations with low-gain impedance generate tool-specific data. For real-time operation, sparse GP mechanisms or buffer-based truncation are employed.
Online Inference: At each control step, the system estimates expected torque (for the tool mode) and its confidence, reverting to fallback impedance control on uncertainty or exogenous perturbations (Haninger et al., 2019).
Scalability: Naive GP inference is cubic in sample count, necessitating fast approximate methods (FITC, SOD) or bounded buffers for deployment.
Safety and Generalization: Passivity guarantees, disturbance rejection, and decoupling of tool-centric motion from semantic gripper cues contribute to robust and interpretable autonomous operation.

By leveraging the tool’s predicted trajectory, the Tool-Centric Inverse Dynamics Model framework enables a robust, modular, and generalizable approach for grounding high-level video-generated plans in executable robot actions, supporting complex manipulation across a range of tool, viewpoint, and object variability (MI et al., 26 Jan 2026, Haninger et al., 2019).

Markdown Report Issue Upgrade to Chat

References (2)

TC-IDM: Grounding Video Generation for Executable Zero-shot Robot Motion (2026)

Nonparametric Inverse Dynamic Models for Multimodal Interactive Robots (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tool-Centric Inverse Dynamics Model (TC-IDM).