Tool-Centric Inverse Dynamics Model
- TC-IDM is an innovative framework that integrates generative video planning with tool-centric control to produce accurate 6-DoF robot motions.
- It combines analytic geometry and vision-driven learning to extract 3D tool trajectories and gripper commands from RGB-D observations.
- Experimental results show TC-IDM improves success rates by 30–50 percentage points over baselines, excelling in tasks including deformable object handling.
The Tool-Centric Inverse Dynamics Model (TC-IDM) is an architectural and algorithmic approach for bridging the gap between high-level generative world models, typically realized as video-based vision-language planners, and the execution of low-level, physically-actionable robot control commands. By anchoring on the imagined trajectory of the robot's tool (i.e., end-effector) generated by a world model, TC-IDM provides a robust and interpretable intermediate representation—enabling the translation of visually planned actions into 6-DoF end-effector motions and corresponding gripper commands, even in scenarios involving novel tools, diverse end-effectors, and previously unseen object interactions, including with deformable materials. TC-IDM combines elements of geometry-grounded inverse dynamics, learned vision-driven grasp policy, and nonparametric modeling, supporting both analytic and data-driven components and facilitating viewpoint invariance and strong generalization properties (MI et al., 26 Jan 2026, Haninger et al., 2019).
1. Formal Definition and Interface
TC-IDM operates at the intersection of generative planning and low-level control. The primary interface consists of:
- Inputs:
- Generated RGB video produced by a video-based world model and accompanied by depth maps and camera poses.
- An initial RGB-D observation and task instruction .
- Output:
- A timed sequence of control commands , where:
- encodes the 6-DoF pose of the tool center point (TCP).
- is the continuous gripper aperture.
The planned trajectory is captured as a cloud of 3D points (waypoints) on the tool, , extracted from generated video and depth. This tool-centric representation is both more tolerant to scene complexity and directly mappable to robot motions (MI et al., 26 Jan 2026).
2. Mathematical Formulation
TC-IDM comprises both analytic and learned components:
- Analytic Geometry-Grounded Inverse Dynamics:
- For each time pair , given point correspondences , solve for the rigid transformation:
- The solution yields the 6-DoF TCP motion .
Learned Vision-Driven Gripper Control:
- Dense semantic embeddings are extracted using a frozen DINOv3 encoder: .
- An MLP "GripperHead" regresses the gripper aperture:
- Trained by imitation, the loss is
where is the ground-truth from demonstrations.
Combined loss (for learned branch):
In the canonical TC-IDM, is omitted due to analytic inversion.
3. Architectural Components
The architecture divides the inference and action mapping workflow into modular stages:
Segmentation and 3D Motion Estimation:
- Gripper/tool masks are generated via SAM 3 segmentation of the imagined video frames.
- Conditioned on these masks, a 3D point tracker (SpatialTrackerv2) yields dense point trajectories.
- Rigid-body filtering selects the point tracks on the tool that best fit a rigid transform, enforcing tool-centricity.
- Decoupled Action Heads:
- A geometry-based head analytically computes TCP pose via rigid alignment; this branch is nonparametric and requires no learned weights.
- A vision-driven head operates on high-dimensional DINOv3 embeddings to produce the gripper aperture; this branch is learned via MLP and is supervised by demonstration data.
- The decoupling ensures semantic grasp cues (e.g., open/close, pinch) are distinguished from 3D spatial motion (MI et al., 26 Jan 2026).
4. Operational Paradigm: Planning, Representation, and Translation
- Plan-and-Translate:
- Planning: A video world model (e.g., WoW, Cosmos2, Kling) accepts initial state and instruction to generate the future RGB video encapsulating the tool’s full trajectory.
- Representation: This video is transformed into 3D tool-point clouds using segmentation and depth alignment.
- Translation: At each timestep, analytic pose extraction (for ) and learned gripper inference synergize to yield a final robot command , streamed at 100–500 Hz to the robot controller for real-time execution (MI et al., 26 Jan 2026).
- Inference Pipeline Summary:
| Step | Operation | Method/Component |
|---|---|---|
| 1 | Get initial state | Sensor RGB-D, text |
| 2 | Generate video | World model (diff/transformer) |
| 3 | Align depth/camera pose | Sensor fusion |
| 4 | Segment gripper/tool | SAM 3 |
| 5 | Track points | SpatialTrackerv2 |
| 6 | Filter for tool | Rigid-body criterion |
| 7 | Extract actions | Analytic + MLP heads |
| 8 | Stream | Robot control layer |
5. Experimental Evaluation and Comparative Analysis
TC-IDM was evaluated on nine real-world manipulation tasks stratified by difficulty and on zero-shot deformable object manipulations (cloth removal, folding, hoodie folding). Primary metric was task success rate (binary), with results:
- Overall (across video models): 61.11%
- Simple (easy and medium): 77.7%
- Zero-shot deformable: 38.46%
Compared to end-to-end vision-language-action (VLA) baselines (e.g., , OpenVLA, RT-2, Octo) and video-conditioned inverse dynamics models (AVDC, VidBot, AnyPos), TC-IDM yielded 30–50pp higher performance on hard tasks, near-perfect on easy cases, and unprecedented zero-shot capability for deformable objects. These results establish TC-IDM as a robust “last-mile” bridge, generalizing to new viewpoints (Apple Pro & D435i), cross-embodiment tasks (single- vs. dual-arm), and unseen cloth handling (MI et al., 26 Jan 2026).
6. Foundations in Nonparametric Tool-Centric Inverse Dynamics
The TC-IDM concept generalizes earlier nonparametric, tool-aware inverse dynamics, as described by Haninger & Tomizuka (Haninger et al., 2019):
- In multimodal tool scenarios, explicit clustering via (soft) EM or collapsed Gibbs over Gaussian Process (GP) models associates each tool mode with a distinct inverse dynamics residual predictor .
- The system leverages the tool's identity (mode) and online experience to learn or switch between GPs, with real-time assignment and disturbance (collision) rejection.
- Formal passivity is shown by constructing a composite storage function , guaranteeing the closed loop with impedance control remains passive.
- For a new tool, the framework discovers and adapts a new GP mode, while outlier detection mechanisms exclude transient disturbances, maintaining safe feedforward control.
7. Implementation and Practical Considerations
- Data Acquisition: Controlled explorations with low-gain impedance generate tool-specific data. For real-time operation, sparse GP mechanisms or buffer-based truncation are employed.
- Online Inference: At each control step, the system estimates expected torque (for the tool mode) and its confidence, reverting to fallback impedance control on uncertainty or exogenous perturbations (Haninger et al., 2019).
- Scalability: Naive GP inference is cubic in sample count, necessitating fast approximate methods (FITC, SOD) or bounded buffers for deployment.
- Safety and Generalization: Passivity guarantees, disturbance rejection, and decoupling of tool-centric motion from semantic gripper cues contribute to robust and interpretable autonomous operation.
By leveraging the tool’s predicted trajectory, the Tool-Centric Inverse Dynamics Model framework enables a robust, modular, and generalizable approach for grounding high-level video-generated plans in executable robot actions, supporting complex manipulation across a range of tool, viewpoint, and object variability (MI et al., 26 Jan 2026, Haninger et al., 2019).