Papers
Topics
Authors
Recent
Search
2000 character limit reached

Tool-Centric Inverse Dynamics Model

Updated 2 February 2026
  • TC-IDM is an innovative framework that integrates generative video planning with tool-centric control to produce accurate 6-DoF robot motions.
  • It combines analytic geometry and vision-driven learning to extract 3D tool trajectories and gripper commands from RGB-D observations.
  • Experimental results show TC-IDM improves success rates by 30–50 percentage points over baselines, excelling in tasks including deformable object handling.

The Tool-Centric Inverse Dynamics Model (TC-IDM) is an architectural and algorithmic approach for bridging the gap between high-level generative world models, typically realized as video-based vision-language planners, and the execution of low-level, physically-actionable robot control commands. By anchoring on the imagined trajectory of the robot's tool (i.e., end-effector) generated by a world model, TC-IDM provides a robust and interpretable intermediate representation—enabling the translation of visually planned actions into 6-DoF end-effector motions and corresponding gripper commands, even in scenarios involving novel tools, diverse end-effectors, and previously unseen object interactions, including with deformable materials. TC-IDM combines elements of geometry-grounded inverse dynamics, learned vision-driven grasp policy, and nonparametric modeling, supporting both analytic and data-driven components and facilitating viewpoint invariance and strong generalization properties (MI et al., 26 Jan 2026, Haninger et al., 2019).

1. Formal Definition and Interface

TC-IDM operates at the intersection of generative planning and low-level control. The primary interface consists of:

  • Inputs:
    • Generated RGB video Vrgb-gen={Irgb-gent}t=0TV_{\text{rgb-gen}} = \{I_{\text{rgb-gen}}^t\}_{t=0}^T produced by a video-based world model and accompanied by depth maps and camera poses.
    • An initial RGB-D observation and task instruction LL.
  • Output:
    • A timed sequence of control commands u(t)=(uTCP(t),ugripper(t))u(t) = \left( u_{\text{TCP}}(t), u_{\text{gripper}}(t)\right), where:
    • uTCP(t)SE(3)u_{\text{TCP}}(t) \in SE(3) encodes the 6-DoF pose of the tool center point (TCP).
    • ugripper(t)Ru_{\text{gripper}}(t) \in \mathbb{R} is the continuous gripper aperture.

The planned trajectory is captured as a cloud of 3D points (waypoints) on the tool, τtool(t)={xit}i=1K\tau_{\text{tool}}(t) = \{\mathbf{x}_i^t\}_{i=1}^K, extracted from generated video and depth. This tool-centric representation is both more tolerant to scene complexity and directly mappable to robot motions (MI et al., 26 Jan 2026).

2. Mathematical Formulation

TC-IDM comprises both analytic and learned components:

  • Analytic Geometry-Grounded Inverse Dynamics:
    • For each time pair (t,t+1)(t,\,t{+}1), given point correspondences Tgrippert,Tgrippert+1\mathcal{T}_{\text{gripper}}^t, \mathcal{T}_{\text{gripper}}^{t+1}, solve for the rigid transformation:

    (Rt,pt)=argminRSO(3),pR3i=1KRxit+pxit+12(\mathbf{R}_t, \mathbf{p}_t) = \arg\min_{\mathbf{R}\in SO(3),\,\mathbf{p}\in\mathbb{R}^3}\sum_{i=1}^K \left\| \mathbf{R} \mathbf{x}_i^t + \mathbf{p} - \mathbf{x}_i^{t+1} \right\|^2 - The solution yields the 6-DoF TCP motion uTCP(t+1)=(Rt,pt)u_{\text{TCP}}(t+1) = (\mathbf{R}_t, \mathbf{p}_t).

  • Learned Vision-Driven Gripper Control:

    • Dense semantic embeddings fdinotf_{\text{dino}}^t are extracted using a frozen DINOv3 encoder: Fdino={fdinot}t=0TF_{\text{dino}} = \{f_{\text{dino}}^t\}_{t=0}^T.
    • An MLP "GripperHead" regresses the gripper aperture:

    ugripper(t)=GripperHead(fdinot)u_{\text{gripper}}(t) = \mathrm{GripperHead}(f_{\text{dino}}^t) - Trained by imitation, the loss is

    Lgripper=t=0Tugripper(t)ugripper(t)2\mathcal{L}_{\text{gripper}} = \sum_{t=0}^{T} \left\|u_{\text{gripper}}(t) - u_{\text{gripper}}^\star(t)\right\|^2

    where ugripper(t)u_{\text{gripper}}^\star(t) is the ground-truth from demonstrations.

  • Combined loss (for learned branch):

    L=Lgripper+λLTCP\mathcal{L} = \mathcal{L}_{\text{gripper}} + \lambda \mathcal{L}_{\text{TCP}}

    In the canonical TC-IDM, LTCP\mathcal{L}_{\text{TCP}} is omitted due to analytic inversion.

3. Architectural Components

The architecture divides the inference and action mapping workflow into modular stages:

  • Segmentation and 3D Motion Estimation:

    • Gripper/tool masks are generated via SAM 3 segmentation of the imagined video frames.
    • Conditioned on these masks, a 3D point tracker (SpatialTrackerv2) yields dense point trajectories.
    • Rigid-body filtering selects the KK point tracks on the tool that best fit a rigid transform, enforcing tool-centricity.
  • Decoupled Action Heads:
    • A geometry-based head analytically computes TCP pose via rigid alignment; this branch is nonparametric and requires no learned weights.
    • A vision-driven head operates on high-dimensional DINOv3 embeddings to produce the gripper aperture; this branch is learned via MLP and is supervised by demonstration data.
    • The decoupling ensures semantic grasp cues (e.g., open/close, pinch) are distinguished from 3D spatial motion (MI et al., 26 Jan 2026).

4. Operational Paradigm: Planning, Representation, and Translation

  • Plan-and-Translate:
    • Planning: A video world model (e.g., WoW, Cosmos2, Kling) accepts initial state and instruction to generate the future RGB video encapsulating the tool’s full trajectory.
    • Representation: This video is transformed into 3D tool-point clouds using segmentation and depth alignment.
    • Translation: At each timestep, analytic pose extraction (for uTCPu_{\text{TCP}}) and learned gripper inference synergize to yield a final robot command u(t)u(t), streamed at 100–500 Hz to the robot controller for real-time execution (MI et al., 26 Jan 2026).
  • Inference Pipeline Summary:
Step Operation Method/Component
1 Get initial state Sensor RGB-D, text
2 Generate video World model (diff/transformer)
3 Align depth/camera pose Sensor fusion
4 Segment gripper/tool SAM 3
5 Track points SpatialTrackerv2
6 Filter for tool Rigid-body criterion
7 Extract actions Analytic + MLP heads
8 Stream u(t)u(t) Robot control layer

5. Experimental Evaluation and Comparative Analysis

TC-IDM was evaluated on nine real-world manipulation tasks stratified by difficulty and on zero-shot deformable object manipulations (cloth removal, folding, hoodie folding). Primary metric was task success rate (binary), with results:

  • Overall (across video models): 61.11%
  • Simple (easy and medium): 77.7%
  • Zero-shot deformable: 38.46%

Compared to end-to-end vision-language-action (VLA) baselines (e.g., π0\pi_0, OpenVLA, RT-2, Octo) and video-conditioned inverse dynamics models (AVDC, VidBot, AnyPos), TC-IDM yielded 30–50pp higher performance on hard tasks, near-perfect on easy cases, and unprecedented zero-shot capability for deformable objects. These results establish TC-IDM as a robust “last-mile” bridge, generalizing to new viewpoints (Apple Pro & D435i), cross-embodiment tasks (single- vs. dual-arm), and unseen cloth handling (MI et al., 26 Jan 2026).

6. Foundations in Nonparametric Tool-Centric Inverse Dynamics

The TC-IDM concept generalizes earlier nonparametric, tool-aware inverse dynamics, as described by Haninger & Tomizuka (Haninger et al., 2019):

  • In multimodal tool scenarios, explicit clustering via (soft) EM or collapsed Gibbs over Gaussian Process (GP) models associates each tool mode kk with a distinct inverse dynamics residual predictor hk(x)=τg(q)h_k(x) = \tau - g(q).
  • The system leverages the tool's identity (mode) and online experience to learn or switch between GPs, with real-time assignment and disturbance (collision) rejection.
  • Formal passivity is shown by constructing a composite storage function S(q,q˙)S(q, \dot{q}), guaranteeing the closed loop with impedance control remains passive.
  • For a new tool, the framework discovers and adapts a new GP mode, while outlier detection mechanisms exclude transient disturbances, maintaining safe feedforward control.

7. Implementation and Practical Considerations

  • Data Acquisition: Controlled explorations with low-gain impedance generate tool-specific data. For real-time operation, sparse GP mechanisms or buffer-based truncation are employed.
  • Online Inference: At each control step, the system estimates expected torque (for the tool mode) and its confidence, reverting to fallback impedance control on uncertainty or exogenous perturbations (Haninger et al., 2019).
  • Scalability: Naive GP inference is cubic in sample count, necessitating fast approximate methods (FITC, SOD) or bounded buffers for deployment.
  • Safety and Generalization: Passivity guarantees, disturbance rejection, and decoupling of tool-centric motion from semantic gripper cues contribute to robust and interpretable autonomous operation.

By leveraging the tool’s predicted trajectory, the Tool-Centric Inverse Dynamics Model framework enables a robust, modular, and generalizable approach for grounding high-level video-generated plans in executable robot actions, supporting complex manipulation across a range of tool, viewpoint, and object variability (MI et al., 26 Jan 2026, Haninger et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tool-Centric Inverse Dynamics Model (TC-IDM).