CoMoVi Dataset Overview

Updated 22 January 2026

CoMoVi Dataset is a comprehensive large-scale resource containing 54,053 high-res video clips annotated with synchronized 3D human motion and textual descriptions.
It employs a multi-stage pipeline, including multimodal filtering and human-tracking, to ensure consistent resolution, frame rate, and accurate pose annotation.
The dataset facilitates joint text-to-video and text-to-motion synthesis research by providing detailed, single-subject clips ideal for training dual-branch VDMs with cross-attention.

The CoMoVi Dataset designates one of two independent research datasets in contemporary motion analysis and video generation literature: (1) the large-scale human motion/video dataset introduced by Zhang et al. (“CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos” (Zhao et al., 15 Jan 2026)) and (2) the “CoMo” compositional motion customization benchmark by Xu et al., sometimes informally referred to as “CoMoVi” (“CoMo: Compositional Motion Customization for Text-to-Video Generation” (Xu et al., 27 Oct 2025)). These two resources address different but occasionally overlapping domains—high-fidelity paired text/motion/video corpus versus compositional multi-motion video synthesis evaluation—necessitating technical precision in terminology. Below, the primary focus is on the CoMoVi Dataset of Zhang et al. (Zhao et al., 15 Jan 2026), with editorial cross-references to the compositional “CoMo/CoMoVi” benchmark (Xu et al., 27 Oct 2025) to clarify potential ambiguities in citation and usage.

1. Dataset Composition and Structure

The CoMoVi Dataset (Zhao et al., 15 Jan 2026) comprises 54,053 high-resolution video clips, each 81 frames in length (5.06 seconds at 16 fps), with all frames rescaled uniformly to 704×1280 RGB. Original videos are at least 720p; frame rates and video formats are standardized (mp4/H.264 encoding), compatible with Video Diffusion Model (VDM) pipelines. The corpus represents a wide range of real-world human motions, including but not limited to locomotion (walking, running), full-body gestures (jumping, stretching), everyday activities (sitting, standing), and complex transitional or composite poses. Individual clips isolate a single human subject, full-body visible, and intentionally exclude interaction with scene objects.

No explicit taxonomy or detailed motion class histogram is published. All videos are organized such that each is accompanied by paired textual and 3D motion annotations, meeting the requirements for both video-based and pose-based generative modeling.

2. Collection Pipeline and Pre-Processing

Data collection proceeds through a multi-stage, multimodal filtering and annotation protocol. Source material includes Koala-36M [public], HumanVid [public], and additional internet videos. The pipeline is as follows:

Stage 1: Multimodal Filtering
- Dense captions are autogenerated for every candidate video.
- Qwen3 LLM evaluates captions, enforcing: human subject, single person, continuous full-body visibility, absence of required object interactions.
- Visual filtering is then applied: Qwen2.5-VL on keyframes to reject non-conforming clips (children, cartoons, presentations).
Stage 2: Human-Tracking Filtering
- Surviving videos are segmented into non-overlapping 5-second clips (up to 2 per video).
- YOLO + ViTPose are run per frame to ensure reliable, continuous full-body detection; clips lacking sufficient confidence are discarded.
Stage 3: Video Captioning
- Gemini-2.5-Pro LLM generates one appearance+motion caption per video at 1 fps, then merges this into a single clip-level textual description.
- The captioning prompt mandates a one-sentence report that situates clothing, appearance, and dynamic human movement.
Stage 4: 3D Motion Annotation
- CameraHMR estimates pseudo-ground-truth SMPL pose (and shape) for each frame.
- Shape is fixed to the first frame for temporal consistency, with pose temporal smoothing performed in Blender to reduce high-frequency jitter.

Additional quality control steps include thresholding by 2D keypoint confidence, human-in-the-loop prompt engineering to enforce single-subject, full-body consistency, and the removal of samples not meeting annotation, resolution, or visibility requirements.

3. Annotation Schema

Each clip is annotated with:

Text annotations: One short, English free-form sentence describing both static appearance and dynamic action, e.g., “A man wearing a red shirt stands up from a chair and raises both hands overhead.” Captions are generated automatically (no human hand-labeling reported), referencing a fixed prompt template detailed in supplementary materials.
3D motion (SMPL) annotations: Per-frame pseudo-ground-truth SMPL estimates (24 joints, camera-centered 3D locations, fixed shape vector), temporally smoothed. Final storage is likely a per-clip bundle containing:
- The video clip (mp4/H.264)
- Caption (text)
- Framewise SMPL joint positions, e.g., NumPy array of shape (81, 24, 3)

No explicit directory structure or file format specifiers are included in the main paper.

4. Dataset Splits and Statistics

Fixed train/val/test splits or category-level breakdowns are not provided. Published experiments train on the full set of ≈54,000 clips and report evaluation on a private, unreleased holdout set. There are no special “hard” splits or task-specific partitions described.

Statistical distributions for caption length, vocabulary, or detailed pose categories are not published. The only published aggregate statistics are total clip count (54,053), fixed per-clip duration (5.06 s), and total data volume (≈76 h of footage).

5. Licensing, Data Access, and Usage Terms

The project page is https://igl-hkust.github.io/CoMoVi/. The authors commit to public release of code and data, but as of the referenced publication, the precise licensing regime is not fixed. Use is described as “for research purposes only”; privacy is enforced by not releasing personal metadata and only providing video identifiers.

Dataset users must cite the canonical paper (“CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos”, Zhang et al.).

6. Evaluation Protocols and Downstream Applications

The main methodological innovation of (Zhao et al., 15 Jan 2026) is the coupling of 3D human motion and video generative models via dual-branch VDMs with 3D-2D cross-attention. The dataset, by providing highly synchronized 2D video, 3D pose, and textual descriptions, supports:

End-to-end training and evaluation of joint text-to-video and text-to-motion generative models.
Rigorous assessment of plausible video and pose synthesis, leveraging paired ground-truth for both modalities.
Studies into the coupling of appearance, motion realism, and captioning in challenging, naturalistic, and constrained human actions.

A closely related benchmark is the “CoMo”/“CoMoVi” compositional motion customization dataset (Xu et al., 27 Oct 2025), which differs notably: it emphasizes compositional multi-motion synthesis and introduces specialized evaluation metrics for motion-appearance disentanglement and blending (e.g., Crop-and-Compare error, TextSim, MotionFid, TempConsistency), but is much smaller in scale, built for benchmarking plug-and-play compositional video generation, not for 3D human motion analysis. Researchers should distinguish clearly between these two datasets in citation and application.

Care must be taken with the term “CoMoVi Dataset,” as it can refer to either (Xu et al., 27 Oct 2025) or (Zhao et al., 15 Jan 2026). The former, sometimes called “CoMo” or “CoMoVi benchmark,” is focused on text-to-video generation with compositionality, using LoRA-based motion/appearance decoupling, small-scale Internet video clips (often in dozens or hundreds), and fine-grained evaluation of compositional synthesis (e.g., mixing a “monkey dancing” and a “lion burpee” in the same frame). The latter—described above—is a large-scale, single-subject, 3D-pose-annotated video corpus intended for synchronous video and motion synthesis.

Both datasets will be available for download from their respective project pages, with code and pretrained weights referenced in their papers. Each requests citation of the corresponding arXiv publication upon use.

For precise technical details and further implementation guidance, consult the official repositories accompanying each respective paper:

CoMoVi (large-scale 3D human motion/video dataset): https://igl-hkust.github.io/CoMoVi/ (Zhao et al., 15 Jan 2026)
CoMo compositional motion customization benchmark: https://como6.github.io/ (Xu et al., 27 Oct 2025)

Markdown Report Issue Upgrade to Chat

References (2)

CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos (2026)

CoMo: Compositional Motion Customization for Text-to-Video Generation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CoMoVi Dataset.