TacUMI: Multi-Modal Manipulation Segmentation

Updated 28 January 2026

TacUMI is a multi-modal system that integrates tactile sensors, a 6-axis force/torque sensor, and drift-free pose tracking for advanced contact-rich manipulation tasks.
It fuses visual, tactile, force, and pose data through temporal neural networks, achieving over 90% segmentation accuracy in complex cable mounting scenarios.
TacUMI enables efficient learning from demonstration by synchronizing multi-modal observations, facilitating modular skill development for long-horizon robotic tasks.

TacUMI is a multi-modal data collection and segmentation system designed to facilitate understanding and learning of complex contact-rich manipulation tasks. Extending the Universal Manipulation Interface (UMI), TacUMI integrates high-resolution tactile sensors, a 6-axis force/torque (F/T) sensor, and drift-free 6-DoF pose tracking within a robot-compatible, handheld gripper platform. The system enables synchronized acquisition of tactile, force, pose, and RGB observations during human demonstrations, supporting robust task decomposition—an essential step for modular skill learning in long-horizon robotic manipulation (Cheng et al., 21 Jan 2026).

1. Motivation and System Architecture

TacUMI addresses the challenges inherent in contact-rich, long-horizon manipulation tasks, where visual and proprioceptive data alone are insufficient to resolve subtle event transitions such as cable tensioning. The original UMI provided handheld, low-cost gripper hardware for recording vision and pose during demonstrations but lacked contact sensing and exhibited pose drift over extended trajectories. TacUMI augments this framework through:

Integration of ViTac tactile sensors (GelSight Mini, 256×256 RGB at 16.7 Hz) into custom fingertip slots, providing deformation imaging.
Rigid mounting of a 6-axis F/T sensor (Bota Systems SensONE, 1000 Hz) between gripper body and handle, designed to mimic a robot flange, which ensures recorded wrenches directly represent robot-environment interactions.
Inclusion of a continuous rack-and-pinion trigger with mechanical self-locking, permitting operators to fix the gripper jaw width so that only external contact forces affect F/T readings.
Use of an HTC Vive Tracker delivering drift-free 6-DoF pose data (60 Hz) registered to the tool center point (TCP).
Acquisition of third-person RGB video using a RealSense D435i camera (60 Hz).
All sensor streams are synchronized by resampling to the lowest common rate (16.7 Hz) via timestamp alignment, generating synchronized multi-modal observation vectors per frame.

TacUMI employs a multi-modal skill segmentation pipeline based on temporal neural architectures. At each time step $t$ , the system extracts:

256-D embedding from tactile images via ResNet-50.
256-D embedding from third-person RGB images via ResNet-18 with GroupNorm and Spatial Softmax.
6-D vector from preprocessed F/T readings.
14-D vector from left/right TCP poses (each 7-D: position + quaternion).

These modalities are concatenated to form $x_t \in \mathbb{R}^{532}$ for temporal modeling. Three backbone architectures are evaluated: (1) three-layer bi-directional LSTM (128 hidden units/way), (2) Temporal Convolutional Network (TCN) with dilated 1D convolutions and residual blocks, and (3) a transformer encoder using sinusoidal position encoding and multi-head self-attention.

Training uses frame-wise cross-entropy loss over $C$ skill classes: $\mathcal{L}(\theta) = -\sum_{t=1}^T \sum_{c=1}^C y_{t,c} \cdot \log p_{t,c}(\theta)$ Inference applies a sliding window of length 50 (stride 10) over $x_t$ sequences, extracting window-level logits. Full sequence labels are restored via soft voting: $\hat y_t = \arg\max_c \frac{1}{K_t} \sum_{k=1}^{K_t} p_{t,c}^{(k)}$

3. Modality Contribution and Fusion Analysis

TacUMI exploits complementary sensing modalities for contact-rich skill segmentation:

Visual data alone underperforms during contact-centric phases such as tensioning.
Tactile signals disambiguate surface deformation states (e.g., cable stretching).
F/T readings capture transitions between linear force and torque-dominated phases.
Pose information primarily reflects tool motion but offers minimal discrimination during stationary phases.

Fusing all four modalities in the TacUMI segmentation framework yields robust boundary detection, achieving greater than 90% frame-wise accuracy, while vision-only ablation reduces accuracy to 76.1%. Modal ablations reveal that adding tactile or F/T data significantly increases segmentation performance.

Modalities Included	Frame-wise Accuracy (%)
Camera only	76.1
Camera + Tactile	90.8
Camera + F/T	86.3
Camera + Pose	81.7
Camera + Tac + F/T	93.6
All (Cam+Tac+F/T+Pose)	≈94.0

As per-class F1 scores improve from 0.32 (transformer, vision only) to greater than 0.90 (bi-directional LSTM, full modalities), multi-modal fusion is essential for accurate and reliable event boundary detection (Cheng et al., 21 Jan 2026).

4. Empirical Evaluation on Cable Mounting Tasks

TacUMI is validated on a physically and semantically complex cable mounting task involving sequential insertion of a cable into three U-clips with varying orientation. During data collection:

The operator uses the TacUMI gripper with F/T and pose sensing in the right hand to apply tension and guide, and a second gripper with tactile and pose for gripping and inserting in the left hand.
A third-person camera records RGB observations.

Task phases for segmentation include (C=5): idle, grasp cable, apply linear tension, apply torque (clip insertion), and release. Segmentation and boundary detection are evaluated using:

$\mathrm{Acc} = \frac{\#\mathrm{correct\_frames}}{\mathrm{total\_frames}} \times 100\%$

$\mathrm{Acc}_{\text{boundaries}} = \frac{\#\mathrm{correct\_transitions}}{\#\mathrm{true\_transitions}} \times 100\%$

Results demonstrate 94.0% frame-wise accuracy (full modality, bi-directional LSTM backbone) and boundary detection exceeding 90% across skill transitions. Data collection via TacUMI is efficient, requiring 1 min 10 s for demonstration compared to 4 min via teleoperation. Cross-platform generalization is shown on teleoperated Franka robots.

5. Core Contributions and Limitations

Key advances of TacUMI include:

Introduction of a handheld UMI-style device featuring synchronized tactile, F/T, and drift-free pose data, as well as a mechanically locked trigger for robot-compatible F/T data.
Development of a segmentation framework fusing vision, tactile, F/T, and pose with three temporal backbones for accurate event segmentation.
Achievement of over 90% segmentation accuracy and efficient demonstration collection, along with validated transfer to robotic execution.

Limitations:

Evaluation is currently restricted to a single cable-mounting scenario, with generalizability to additional long-horizon tasks yet to be demonstrated.
Rigorous sensor calibration, especially for pose-to-TCP and F/T frame transforms, is necessary; calibration errors can degrade segmentation performance.
The "release" phase is very short (2–5 frames), causing boundary ambiguity, and the TCP pose offers limited discriminative information during this phase due to minimal tool motion.

6. Downstream Applications and Future Directions

TacUMI produces segmented, multi-modal demonstrations suitable for a variety of downstream robotics and manipulation tasks, including:

Learning from Demonstration (LfD): construction of modular skill primitives for downstream policy learning or diffusion-based planning.
Skill Library Development: creation of reusable, contact-rich manipulation primitives leveraging multi-modal embedding representations.
Interactive Task Refinement: enabling human annotators to re-demonstrate or correct individual skill modules as opposed to entire trajectories.
Industrial and Service Robotics: deployment in manufacturing pipelines (pneumatic/cable wiring, snap-fit insertion) and fine-grained in-hand manipulation.

A plausible implication is that extension and benchmarking of TacUMI on a broader spectrum of manipulation tasks could reveal its scalability and generalization properties, particularly in the context of modular and hierarchical skill learning for robots (Cheng et al., 21 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

TacUMI: A Multi-Modal Universal Manipulation Interface for Contact-Rich Tasks (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TacUMI.