LoomData-8.7k: Spatial-Temporal Video Dataset

Updated 19 January 2026

LoomData-8.7k is a comprehensive human-centric video dataset featuring detailed frame-wise object masks and temporally grounded action captions.
It employs a hybrid shot detection pipeline with PySceneDetect and KTS, ensuring robust segmentation and high-quality annotations across 1,456 ActivityNet videos.
Its unified JSON Lines format and meticulous annotation enable advanced training of video-language models with precise spatial and temporal reasoning.

LoomData-8.7k is a human-centric video dataset curated for joint spatial-temporal video understanding, providing dense, frame-wise object masks and temporally grounded, natural-language action captions. Released as part of the VideoLoom project, it serves as a large-scale source of training data, supporting the development of large video-LLMs with fine-grained spatial and temporal reasoning capabilities. The corpus consists of 1,456 ActivityNet training videos, encompassing 8,710 carefully segmented "shots" (person-centered video segments), and offers a unified annotation format compatible with modern video modeling pipelines (Shi et al., 12 Jan 2026).

1. Dataset Construction and Structure

LoomData-8.7k originates from the ActivityNet (training split), with scenes partitioned via a hybrid application of PySceneDetect and KTS to detect scene and event boundaries. Key steps in the construction pipeline include merging shots under 1 s and eliminating videos exceeding 10 shots, yielding a filtered cohort of 1,456 videos. On average, each video is 102.2 seconds long (μ), corresponding to roughly 3,066 frames at 30 FPS. Each video is divided into an average of 6.0 shots (σ ≈ 1.2), with each shot spanning a mean of 15.0 s (σ ≈ 9.0), and a heavy-tailed shot-length distribution—over 50% of shots lie within [5, 15] seconds.

All annotation and preprocessing steps follow programmatic, template-driven automation, punctuated by multi-pass human verification for appearance and mask quality.

2. Annotation Workflow and Quality Assurance

Annotations in LoomData-8.7k are generated through a four-stage, predominantly automatic pipeline, interleaved with targeted manual quality control:

Shot Partition: PySceneDetect and KTS delineate scene and event boundaries. Adjacent shots with <1 s duration are merged, and shots are further refined to avoid videos with excessive segmentation.
Spatial Mask Annotation:
- The main character in each shot’s center frame is detected using GroundingDINO, prioritizing the highest-scoring “person” bounding box.
- Pix2Cap produces a textual appearance description for the detected figure.
- SAM2 tracks this subject bidirectionally, delivering dense, per-frame mask tracklets.
- If tracklets are missing, the process is repeated with GroundingDINO, using the appearance description as the detection prompt.
- Manual screening eliminates untrackable or incorrectly masked segments.
Shot Merging: Shots with overlapping tracklets and consistent character identity (across different camera angles) are merged for temporal consistency.
Temporal Action Annotation:
- Videos are sampled at 2 FPS; frames receive unique numeric IDs through NumPro.
- Set-of-Marks overlays these IDs onto mask regions.
- Gemini 2.5 Pro ingests prompt-structured frames to generate detailed, timestamp-aligned natural language action captions.
- Manual curation excises segments with missing or erroneous masks or IDs.

This pipeline ensures every segment is covered by a dense per-frame mask (≈450 per shot) and a high-fidelity, grounded action caption.

3. Annotation Schema and Data Format

Each video is annotated in JSON Lines (.jsonl), one object per shot. Entry fields include:

video_id: Unique ActivityNet video identifier.
fps: Nominal 30 FPS, matching source material.
duration: Video duration in seconds.
segments: List containing, per shot:
- shot_id: Zero-based shot index.
- start_time, end_time: Shot temporal boundaries (seconds).
- start_frame, end_frame: Frame-level shot boundaries.
- action_caption: Timestamp-aligned, frame-indexed narrative generated by Gemini 2.5 Pro, referencing the person’s actions.
- masks: Per-frame mask dicts {frame_id, mask_path, bbox}, with binary PNG masks (1,024 × 1,024 resolution) and bounding boxes in absolute pixel coordinates.

An example entry: $F$ 3

The dataset focuses exclusively on solo "person" instances per shot. Actions are described using an open vocabulary, spanning approximately 200 unique verbs (Shi et al., 12 Jan 2026).

4. Dataset Statistics and Characteristics

Summarized attributes of LoomData-8.7k:

Property	Value/Range	Notes
Videos	1,456	ActivityNet train split (after filtering)
Shots	8,710	Average: 6.0 per video; σ ≈ 1.2
Shot Length	μ = 15.0 s; σ ≈ 9.0	Median ≈ 13.2 s; [1, 45] s, heavy-tailed
Masks per Shot	≈450	Every frame masked at 30 FPS
Duration (all videos)	≈148 h	102.2 s/video, on average
Caption Length (words)	μ = 41.3	Templated, timestamp-aligned narration
Objects per Segment	1 ("person")	Per shot, unique tracked identity
Action Verbs	≈200	Open-vocabulary, naturalistic action phrases

Shot centers are nearly uniformly distributed within each video. Mask annotations are dense, while action descriptions are designed to be referentially precise and temporally grounded (e.g., referencing objects by color or clothing).

5. Benchmarking, Evaluation Metrics, and Usage

LoomData-8.7k is employed entirely as a training resource for VideoLoom, without its own dedicated held-out validation/test split. Evaluation occurs using a separate ActivityNet-based LoomBench benchmark. LoomData-8.7k enables the VideoLoom model to support:

Referring video object segmentation, with region similarity $J$ $J$ and contour accuracy $F$ $F$ :
- $J(P, G) = \frac{|P \cap G|}{|P \cup G|}$
- $F(P, G) = \frac{2|b(P) \cap b(G)|}{|b(P)| + |b(G)|}$
- $J%%%%4%%%%F = (J + F)/2$
Temporal grounding, using temporal Intersection over Union:
- $tIoU(\hat{h}, t^*) = \frac{\text{length}(\hat{h} \cap t^*)}{\text{length}(\hat{h} \cup t^*)}$
Recall metrics at rank- $n$ $n$ for threshold $\tau$ $τ$ :
- $R_n@\tau = \frac{1}{|Q|} \sum_q \mathbb{1}[tIoU(\hat{h}_q^{(n)}, t_q^*) \geq \tau]$
For combined queries, Bidirectional Foreground $J%%%%9%%%%F$ $J$ :
- $F$ 0
- where $F$ 1 are computed over the predicted span, $F$ 2 over ground truth.

LoomData-8.7k’s dense, timestamp-aligned structuring underpins state-of-the-art or competitive results for VideoLoom across spatial and temporal video understanding tasks, such as 63.1 J&F on ReVOS (referring object segmentation) and 48.3 [email protected] on Charades-STA (temporal grounding) (Shi et al., 12 Jan 2026).

6. Significance, Limitations, and Prospective Uses

LoomData-8.7k constitutes a salient advance for the training of video LLMs (Video LLMs), offering a uniquely rich, coherent suite for spatial-temporal reasoning. Its design—centered on long-form, real-world video with dense frame-level annotations and language-aligned temporal labels—addresses a pronounced gap in human-centric multimodal corpora.

While not benchmarked independently, its contribution to VideoLoom’s performance sets a precedent for future datasets targeting spatial-temporal video understanding. The sole focus on single-character segments may limit modeling of multi-agent interactions; however, this restriction ensures clarity and annotation consistency. The absence of a public validation/test split confines its utility to training scenarios; comparative benchmarking is deferred to associated datasets (e.g., LoomBench).

A plausible implication is that LoomData-8.7k, by virtue of its annotation density, duration, and detailed natural language grounding, enables not only improved segmentation and tracking models but also facilitates research into the compositional and hierarchical structure of human activity in video (Shi et al., 12 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LoomData-8.7k.