Papers
Topics
Authors
Recent
Search
2000 character limit reached

LoomData-8.7k: Spatial-Temporal Video Dataset

Updated 19 January 2026
  • LoomData-8.7k is a comprehensive human-centric video dataset featuring detailed frame-wise object masks and temporally grounded action captions.
  • It employs a hybrid shot detection pipeline with PySceneDetect and KTS, ensuring robust segmentation and high-quality annotations across 1,456 ActivityNet videos.
  • Its unified JSON Lines format and meticulous annotation enable advanced training of video-language models with precise spatial and temporal reasoning.

LoomData-8.7k is a human-centric video dataset curated for joint spatial-temporal video understanding, providing dense, frame-wise object masks and temporally grounded, natural-language action captions. Released as part of the VideoLoom project, it serves as a large-scale source of training data, supporting the development of large video-LLMs with fine-grained spatial and temporal reasoning capabilities. The corpus consists of 1,456 ActivityNet training videos, encompassing 8,710 carefully segmented "shots" (person-centered video segments), and offers a unified annotation format compatible with modern video modeling pipelines (Shi et al., 12 Jan 2026).

1. Dataset Construction and Structure

LoomData-8.7k originates from the ActivityNet (training split), with scenes partitioned via a hybrid application of PySceneDetect and KTS to detect scene and event boundaries. Key steps in the construction pipeline include merging shots under 1 s and eliminating videos exceeding 10 shots, yielding a filtered cohort of 1,456 videos. On average, each video is 102.2 seconds long (μ), corresponding to roughly 3,066 frames at 30 FPS. Each video is divided into an average of 6.0 shots (σ ≈ 1.2), with each shot spanning a mean of 15.0 s (σ ≈ 9.0), and a heavy-tailed shot-length distribution—over 50% of shots lie within [5, 15] seconds.

All annotation and preprocessing steps follow programmatic, template-driven automation, punctuated by multi-pass human verification for appearance and mask quality.

2. Annotation Workflow and Quality Assurance

Annotations in LoomData-8.7k are generated through a four-stage, predominantly automatic pipeline, interleaved with targeted manual quality control:

  1. Shot Partition: PySceneDetect and KTS delineate scene and event boundaries. Adjacent shots with <1 s duration are merged, and shots are further refined to avoid videos with excessive segmentation.
  2. Spatial Mask Annotation:
    • The main character in each shot’s center frame is detected using GroundingDINO, prioritizing the highest-scoring “person” bounding box.
    • Pix2Cap produces a textual appearance description for the detected figure.
    • SAM2 tracks this subject bidirectionally, delivering dense, per-frame mask tracklets.
    • If tracklets are missing, the process is repeated with GroundingDINO, using the appearance description as the detection prompt.
    • Manual screening eliminates untrackable or incorrectly masked segments.
  3. Shot Merging: Shots with overlapping tracklets and consistent character identity (across different camera angles) are merged for temporal consistency.
  4. Temporal Action Annotation:
    • Videos are sampled at 2 FPS; frames receive unique numeric IDs through NumPro.
    • Set-of-Marks overlays these IDs onto mask regions.
    • Gemini 2.5 Pro ingests prompt-structured frames to generate detailed, timestamp-aligned natural language action captions.
    • Manual curation excises segments with missing or erroneous masks or IDs.

This pipeline ensures every segment is covered by a dense per-frame mask (≈450 per shot) and a high-fidelity, grounded action caption.

3. Annotation Schema and Data Format

Each video is annotated in JSON Lines (.jsonl), one object per shot. Entry fields include:

  • video_id: Unique ActivityNet video identifier.
  • fps: Nominal 30 FPS, matching source material.
  • duration: Video duration in seconds.
  • segments: List containing, per shot:
    • shot_id: Zero-based shot index.
    • start_time, end_time: Shot temporal boundaries (seconds).
    • start_frame, end_frame: Frame-level shot boundaries.
    • action_caption: Timestamp-aligned, frame-indexed narrative generated by Gemini 2.5 Pro, referencing the person’s actions.
    • masks: Per-frame mask dicts {frame_id, mask_path, bbox}, with binary PNG masks (1,024 × 1,024 resolution) and bounding boxes in absolute pixel coordinates.

An example entry:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
{
  "video_id": "v_000045",
  "duration": 98.3,
  "fps": 30,
  "segments": [
    {
      "shot_id": 2,
      "start_time": 30.0,
      "end_time": 45.0,
      "start_frame": 900,
      "end_frame": 1350,
      "action_caption": "Frame 901: the person wearing a blue jacket opens the refrigerator; … Frame 1 350: the person closes the door and turns around.",
      "masks": [
        {"frame_id": 900, "mask_path": ".../900.png", "bbox": [234, 120, 312, 482]},
        ⋮
        {"frame_id": 1350, "mask_path": ".../1350.png", "bbox": [228, 115, 315, 480]}
      ]
    }
  ]
}

The dataset focuses exclusively on solo "person" instances per shot. Actions are described using an open vocabulary, spanning approximately 200 unique verbs (Shi et al., 12 Jan 2026).

4. Dataset Statistics and Characteristics

Summarized attributes of LoomData-8.7k:

Property Value/Range Notes
Videos 1,456 ActivityNet train split (after filtering)
Shots 8,710 Average: 6.0 per video; σ ≈ 1.2
Shot Length μ = 15.0 s; σ ≈ 9.0 Median ≈ 13.2 s; [1, 45] s, heavy-tailed
Masks per Shot ≈450 Every frame masked at 30 FPS
Duration (all videos) ≈148 h 102.2 s/video, on average
Caption Length (words) μ = 41.3 Templated, timestamp-aligned narration
Objects per Segment 1 ("person") Per shot, unique tracked identity
Action Verbs ≈200 Open-vocabulary, naturalistic action phrases

Shot centers are nearly uniformly distributed within each video. Mask annotations are dense, while action descriptions are designed to be referentially precise and temporally grounded (e.g., referencing objects by color or clothing).

5. Benchmarking, Evaluation Metrics, and Usage

LoomData-8.7k is employed entirely as a training resource for VideoLoom, without its own dedicated held-out validation/test split. Evaluation occurs using a separate ActivityNet-based LoomBench benchmark. LoomData-8.7k enables the VideoLoom model to support:

  • Referring video object segmentation, with region similarity JJ and contour accuracy FF:
    • J(P,G)=PGPGJ(P, G) = \frac{|P \cap G|}{|P \cup G|}
    • F(P,G)=2b(P)b(G)b(P)+b(G)F(P, G) = \frac{2|b(P) \cap b(G)|}{|b(P)| + |b(G)|}
    • JJ%%%%4%%%%F = (J + F)/2
  • Temporal grounding, using temporal Intersection over Union:
    • tIoU(h^,t)=length(h^t)length(h^t)tIoU(\hat{h}, t^*) = \frac{\text{length}(\hat{h} \cap t^*)}{\text{length}(\hat{h} \cup t^*)}
  • Recall metrics at rank-nn for threshold τ\tau:
    • Rn@τ=1Qq1[tIoU(h^q(n),tq)τ]R_n@\tau = \frac{1}{|Q|} \sum_q \mathbb{1}[tIoU(\hat{h}_q^{(n)}, t_q^*) \geq \tau]
  • For combined queries, Bidirectional Foreground JJ%%%%9%%%%F:
    • JJ%%%%10%%%%F_{bi} = \frac{(J_p+F_p) \times (J_g+F_g)}{(J_p+F_p)+(J_g+F_g)}
    • where (Jp,Fp)(J_p, F_p) are computed over the predicted span, (Jg,Fg)(J_g, F_g) over ground truth.

LoomData-8.7k’s dense, timestamp-aligned structuring underpins state-of-the-art or competitive results for VideoLoom across spatial and temporal video understanding tasks, such as 63.1 J&F on ReVOS (referring object segmentation) and 48.3 [email protected] on Charades-STA (temporal grounding) (Shi et al., 12 Jan 2026).

6. Significance, Limitations, and Prospective Uses

LoomData-8.7k constitutes a salient advance for the training of video LLMs (Video LLMs), offering a uniquely rich, coherent suite for spatial-temporal reasoning. Its design—centered on long-form, real-world video with dense frame-level annotations and language-aligned temporal labels—addresses a pronounced gap in human-centric multimodal corpora.

While not benchmarked independently, its contribution to VideoLoom’s performance sets a precedent for future datasets targeting spatial-temporal video understanding. The sole focus on single-character segments may limit modeling of multi-agent interactions; however, this restriction ensures clarity and annotation consistency. The absence of a public validation/test split confines its utility to training scenarios; comparative benchmarking is deferred to associated datasets (e.g., LoomBench).

A plausible implication is that LoomData-8.7k, by virtue of its annotation density, duration, and detailed natural language grounding, enables not only improved segmentation and tracking models but also facilitates research into the compositional and hierarchical structure of human activity in video (Shi et al., 12 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LoomData-8.7k.