RoVid-X: Robotic Video Dataset

Updated 23 January 2026

RoVid-X is a large-scale robotic video dataset comprising 4 million 4–6 second clips with comprehensive annotations for diverse, real-world robotic tasks.
The dataset is produced via a four-stage automated pipeline that performs video collection, quality filtering, task segmentation, and physical property annotation including optical flow and depth maps.
RoVid-X integrates with the RBench suite to provide reproducible evaluation metrics that exhibit a high correlation (ρ = 0.96) with human judgments on video realism and task success.

RoVid-X is a large-scale, open-source robotic video dataset designed to advance video generation and analysis tasks for embodied intelligence. Built upon a fully automated four-stage pipeline, RoVid-X comprises 4 million annotated clips covering thousands of real-world robotic tasks, providing comprehensive physical property annotations. Integrated with the RBench evaluation suite, RoVid-X enables robust, reproducible benchmarking of embodied video models with metrics closely aligned to human evaluations (Deng et al., 21 Jan 2026).

1. Dataset Scope and Structure

RoVid-X is, to date, the largest open-source dataset for embodied video generation. Each of its 4 million video clips is 4–6 seconds long (approximately 120 frames) at 720p resolution (1280×720). The dataset is balanced across five principal task domains and four robotic embodiments:

Task Domains	Embodiments	Clips per Group (Approx.)
Common Manipulation	Single-arm robots	0.8–1M
Long-Horizon Planning	Dual-arm robots	0.8–1M
Multi-Entity Collaboration	Humanoid robots	0.8–1M
Spatial Relationship	Quadruped robots	0.8–1M
Visual Reasoning		0.8–1M

Each clip is thoroughly annotated with:

Task captions and segment timestamps for every sub-action.
Dense optical flow fields for each frame (via AllTracker), stored as compressed NumPy arrays.
Relative depth maps per frame (via Video Depth Anything, in PNG format).
Frame-level resolution enhancement using FlashVSR for consistent visual fidelity.

The dataset features over 1,300 distinct skills, broad coverage of object categories, and diverse inter-action intervals, reflecting the multifaceted nature of embodied tasks.

2. Four-Stage Data Generation Pipeline

RoVid-X employs an automated, scalable four-stage pipeline to ensure quality, relevance, and comprehensive annotation:

Stage 1: Robot Video Collection

Sourcing from 20+ open-source embodied datasets and public internet videos.
Zero-shot GPT-5-based MLLM classifies clips as “robot task” or “irrelevant,” yielding approximately 3 million raw candidate clips.

Stage 2: Video Quality Filtering

Scene segmentation eliminates non-robotic fragments.
A composite quality score—taking into account clarity, dynamic range, aesthetic appeal, and OCR legibility—is used; clips below a set quality threshold are discarded.

Stage 3: Task Segmentation & Captioning

Specialized video understanding models divide clips into atomic sub-tasks with precise temporal boundaries.
MLLM-based captioning extracts (subject, object, action) triples and issues standardized subtitles, e.g., “[00:01–00:02] right gripper → grasp cup.”

Stage 4: Physical Property Annotation

FlashVSR upscales frames to 720p.
AllTracker generates dense optical flow $F_t \in \mathbb{R}^{H \times W \times 2}$ per frame.
Video Depth Anything produces relative depth maps $D_t \in [0,1]^{H \times W}$ .

3. Annotation Schemes and Metric Foundations

RoVid-X’s annotations support both descriptive and quantitative evaluation of embodied behaviors, underpinning the RBench metric suite:

Structural Consistency (“Robot-Subject Stability”): For each keyframe pair $(t_1, t_2)$ , a JSON grade in $\{A, ..., E\}$ encodes detected changes in arm count, link length, or topology drift.
Physical Plausibility (“Physical-Semantic Plausibility”): VQA-style MLLM flags floating, interpenetration, or object emergence; score computed as $1 - (\text{number\_of\_violations} / \text{max\_violations})$ .
Action Completeness (“Task-Adherence Consistency”): Fractional coverage of key prompt sub-actions; score $= 5 \times (\text{completed\_steps} / \text{total\_steps})$ .
Motion Metrics:
- Motion Amplitude (MAS): $MAS = \frac{1}{T} \sum_{t=1}^T \min(\hat{D}_t, 1)$ , where $\hat{D}_t$ is the flow-compensated, resolution-normalized mean displacement of robot keypoints.
- Metrics are normalized to $[0,1]$ for comparability.

Formats for annotation storage are:

Video files: videos/720p-clips/*.mp4
Metadata: metadata/*.json (task and segment structure)
Optical flow: flow/*.npy ( $T \times H \times W \times 2$ )
Depth maps: depth/*.png (one per frame)

RoVid-X does not annotate forces or contact points directly, but the combination of depth and flow fields allows for downstream estimation of such physical properties.

4. RBench Integration and Evaluation Correlation

RoVid-X is foundational to RBench, the benchmark that assesses robot video generation across the five task domains and four embodiments. RBench’s metrics leverage RoVid-X annotations to provide reproducible, fine-grained evaluation of physical realism and task success.

An empirical study demonstrates that RBench’s automated scores exhibit a Spearman correlation coefficient of $0.96$ relative to human rankings (using $\rho = 1 - [6 \sum_i d_i^2] / [n(n^2-1)]$ ). This high alignment supports the validity of RBench’s metrics as proxies for human-perceived physical realism and task correctness (Deng et al., 21 Jan 2026).

5. Experimental Findings and Best Practices

RoVid-X and RBench have jointly illuminated several performance patterns among contemporary video generation models:

Physical-World Gaps: State-of-the-art engines frequently generate:
- Floating or interpenetrating limbs.
- Gripper morphologies that change unrealistically during sequences.
- Sequences that omit key sub-actions, especially in long-horizon planning.
Cognitive Bottlenecks: Performance on visual reasoning—including attribute ordering and logic over colors/numbers—lags (average success ≈ 30%).
Embodiment Bias: Quadruped locomotion is more tractable (≈ 70% video quality) than fine-grained single-arm manipulation (≈ 50%).

The following practices are recommended:

Employ depth and flow supervision to enforce real-world geometric and motion priors.
Integrate contact-point or grasp-stability losses, potentially deriving contact labels from flow and depth fields.
Apply data augmentation using synthetic physics engines to broaden the spectrum of physically valid interactions.
Co-train with inverse dynamics models (IDM) to translate generated sequences into executable robotic actions.

6. Data Access, Formats, and Licensing

RoVid-X is distributed under the Creative Commons Attribution 4.0 (CC-BY-4.0) license, supporting broad usage. Data and code are available via:

Standardized file formats facilitate integration:

Videos: 720p .mp4
Metadata: .json with segment structure
Flow: .npy (NumPy arrays)
Depth: .png (one per frame)

RoVid-X’s pipeline, scope, and refined annotations collectively establish a robust foundation for research in embodied video generation, evaluation, and downstream robotic cognition (Deng et al., 21 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Rethinking Video Generation Model for the Embodied World (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RoVid-X.