ManiFlow-110k: 3D Flow for Manipulation
- ManiFlow-110k is a large-scale dataset that offers frame-aligned 3D optical flow sequences synthesized from human and robotic manipulation videos, forming the backbone for cross-embodiment visuomotor training.
- It employs a fully automated pipeline combining moving-object detection, dense 2D flow computation, and 3D projection using depth maps and camera intrinsics to accurately model manipulation tasks.
- Experimental evaluations show that training with ManiFlow-110k significantly boosts real-robot action policy performance (e.g., a 70% success rate) compared to 2D-only flow datasets.
ManiFlow-110k is a large-scale, automatically constructed dataset comprising 110,000 sequences of 3D optical flow for robotic and human manipulation, developed as the backbone for cross-embodiment visuomotor policy training and world modeling. Synthesized from publicly available manipulation video corpora, ManiFlow-110k offers frame-aligned 3D flow fields indexed by language instructions, supporting the development of action policies and generative models that unify visual, language, and physical reasoning across diverse robot platforms and environments (Zhi et al., 6 Jun 2025).
1. Dataset Synthesis and Pipeline
ManiFlow-110k is assembled via a fully automated pipeline operating on raw RGB video clips containing manipulation tasks performed by humans or robots. The pipeline proceeds as follows:
1. Moving-Object Auto-Detection:
- Gripper segmentation is performed on the initial frame using a pre-trained Grounding-SAM2 model, yielding a binary mask .
- Point sampling and tracking: Points are uniformly sampled in the frame, excluding those masked by . These points are tracked across frames using Co-tracker3 to generate trajectories . Points with displacement (with px) are designated as "moving" and clustered to define the manipulated object's region of interest (ROI) via a minimal axis-aligned bounding box.
2. Optical Flow and 3D Projection:
- Within the ROI, dense 2D flow fields are computed using Co-tracker3.
- Global camera motion is compensated using the MAGMA method as needed.
- Per-frame depth is predicted via DepthAnythingV2.
- For each pixel, the lifting maps image coordinates to 3D space using known camera intrinsics , and 3D optical flow is computed corresponding to the tracked 2D endpoint .
3. Data Representation:
Each frame is represented as a four-channel tensor,
concatenating image-space displacement, depth, and a visibility indicator.
The pipeline is applied across six public datasets (BridgeV2, ScalingRobot, Droid, RH20t, Libero, Agibot), harvesting 110,000 unique manipulation clips.
2. Dataset Composition and Modalities
ManiFlow-110k comprises:
- Total clips: 110,000, each –$16$ frames (typically ).
- Spatial resolution: after cropping.
- Object and task variability: More than 100 object shapes (cups, pens, drawers, etc.) across over 20 task templates (pouring, inserting, hanging, opening, grasping).
- Environmental diversity: Laboratory, kitchen, and office settings captured from both human and robot perspectives.
Modalities:
- Raw RGB video at 30 fps.
- Depth maps per frame.
- Reconstructed 3D point clouds.
- Dense ground-truth 3D optical flow field .
Clips are paired with templated, tokenized language instructions using a CLIP-text encoder. Pre-training data augmentation includes random horizontal flips, minor camera rotations, and additive Gaussian noise on depth. Clips with insufficient moving-point coverage or severe occlusion are discarded. All flow and 3D values are normalized to using scene radius scaling; depth is standardized per clip.
3. Quality Evaluation and Metrics
ManiFlow-110k’s construction method yields measurable geometric and task-level fidelity:
- Moving-object detection accuracy: On BridgeV2 annotations, the auto-detect pipeline achieves >80% average precision at IoU=0.5.
- 3D flow accuracy: On synthetic sequences with ground-truth motion, mean endpoint error (EPE) is
and px (DROID baseline: m).
- Downstream utility: Training a flow-world model on ManiFlow-110k yields a 70% success rate on four real-robot tasks, compared to 25% with 2D-only flow datasets (Im2Flow2Act), indicating substantial benefit for action policy learning.
4. Role in Cross-Embodiment Manipulation Learning
ManiFlow-110k forms the backbone of cross-embodiment flow world modeling (Zhi et al., 6 Jun 2025). The dataset enables the training of models that predict future 3D motion of objects, providing an embodiment-agnostic signal applicable to both human and robot agents. In the referenced framework, a video diffusion model conditioned on initial RGB frames, CLIP-encoded text, and sparse initial 3D points synthesizes 3D optical flow trajectories for manipulation tasks.
The predicted 3D object flow is integrated into a flow-guided rendering mechanism, allowing assessment of task satisfaction via GPT-4o. Subsequently, the generated flow fields serve as constraints for optimization-based action planning, closing the loop between high-level task instruction and low-level action selection.
5. Technical Integration and Training Specifications
The video diffusion model utilizing ManiFlow-110k employs:
- Conditioning: CLIP-Vision and CLIP-Text encodings for the initial frame and language prompt, respectively; sinusoidally encoded sparse 3D points.
- Model architecture: AnimateDiff variant—a U-Net-based diffusion backbone with LoRA adapters to preserve StableDiffusion pretraining.
- Training objective: Standard denoising diffusion loss,
with direct feeding of flows (not via a VAE).
- Hyperparameters: Learning rate , batch size 512, 500 epochs, AdamW optimizer (weight decay = 0.01, ), distributed across V100 GPUs for approximately two days.
This pipeline enables large-scale, robust, and platform-agnostic training for manipulation policies that generalize to unseen objects, backgrounds, and robotic embodiments.
6. Comparison with ManiFlow Policy Scaling
While the ManiFlow policy (Yan et al., 1 Sep 2025) does not directly introduce or leverage ManiFlow-110k, its design—single-stage joint flow matching plus consistency training, DiT-X backbone with adaptive cross-attention, and AdaLN-Zero conditioning—can operate at the same scale as ManiFlow-110k. ManiFlow is shown to monotonically improve performance as the demonstration count increases, with success rates on the "lift pot" RoboTwin task rising from 3.7% at 10 demos to 99.7% at 500. This suggests that scaling to 110,000 trajectories (the scale of ManiFlow-110k) would yield near-ceiling performance across simulation and real-robot benchmarks, with substantially enhanced zero-shot generalization (Yan et al., 1 Sep 2025).
7. Significance and Future Directions
ManiFlow-110k represents a shift toward large, diverse, and automatically annotated 3D motion datasets for robot skill acquisition. By grounding manipulation learning in object-centric 3D flow conditioned on human-understandable instructions, it facilitates robust, transferable world models that bridge simulation and real-world generalization. The scalability and automation of its construction suggest increased feasibility of joint human–robot modeling, cross-platform policy deployment, and closed-loop planning that incorporates task- and embodiment-agnostic cues.
A plausible implication is that further expansion of such datasets, or integration with active policy learning in the loop, could enable highly-flexible, general-purpose robotic agents capable of complex reasoning over manipulation tasks across varied domains.