Papers
Topics
Authors
Recent
Search
2000 character limit reached

LoomBench: Spatial–Temporal Evaluation Suite

Updated 19 January 2026
  • LoomBench is a benchmark suite designed to evaluate joint spatial–temporal reasoning in Video LLMs using curated video QA pairs.
  • It systematically challenges models with 'When', 'Where', and combined queries, integrating temporal localization and spatial segmentation.
  • Baseline evaluations, including VideoLoom 8B, show notable improvements in tIoU and segmentation accuracy over modular approaches.

LoomBench is a purpose-built benchmark suite designed to comprehensively evaluate the spatial–temporal reasoning capabilities of Video LLMs (Video LLMs). It systematically probes event grounding in time, localization of actors in space, and complex joint “Where-when” reasoning through a fine-grained set of temporally, spatially, and compositionally challenging question–answer pairs, advancing standards in multimodal intelligence and video understanding (Shi et al., 12 Jan 2026).

1. Motivation for Joint Spatial–Temporal Evaluation

Modern Video LLMs exhibit distinct strengths on either temporal reasoning (e.g., identifying when an event occurs) or spatial perception (e.g., determining the location of actors), but seldom demonstrate robust joint comprehension. LoomBench was conceived to challenge models not only with isolated “When?” and “Where?” queries but also with compositional questions requiring the integration of temporal grounding and spatial segmentation. This design reflects a need to benchmark progress toward unified spatial–temporal understanding, rather than proficiency in isolated modalities.

2. Scale, Source Material, and Evaluation Policy

LoomBench encompasses 130 short video clips sourced from the ActivityNet Validation set, each averaging 17.6 seconds and 4.2 shots per video. The evaluation suite comprises 1,484 meticulously curated question–answer pairs, distributed as follows:

  • 541 “When” questions: Purely temporal queries (e.g., “When does the person in dark clothing pick up the pink frisbee?”)
  • 487 “Where” questions: Spatial localization after temporal interval identification (e.g., “Where is the person when they pick up the pink frisbee?”)
  • 456 “Combined” questions: Compositional challenges demanding temporal identification followed by spatial segmentation (e.g., “Where is the person in dark clothing when they throw the pink frisbee and the dog leaps to catch it?”)

LoomBench is designed strictly for evaluation: no training or validation splits are provided; all examples are reserved for end-to-end testing.

Category Count Example Query
When 541 “When does the person in dark clothing pick up the pink frisbee?”
Where 487 “Where is the person when they pick up the pink frisbee?”
Combined 456 “Where is the person in dark clothing when they throw the pink frisbee and the dog leaps …”

3. Data Collection and Annotation Pipeline

The LoomBench annotation pipeline extends the approach introduced in LoomData-8.7k, employing the following process on ActivityNet Val shots:

  • Automatic Shot Partitioning: PySceneDetect with Kernel Temporal Segmentation (KTS) splits videos into contiguous shots.
  • Main-Character Detection: GroundingDINO identifies principal actors within each shot.
  • Instance Segmentation and Tracking: Segment Anything Model v2 (SAM2) tracks masks of detected characters through shots.
  • Action Description Generation: Gemini 2.5 links action descriptions to frame IDs.
  • QA Pair Synthesis: LLaMA 3.1, prompted with detailed descriptions, generates candidate question-answer pairs.

Quality assurance involves two rounds of manual verification. First, videos lacking mask tracklets or containing inaccurate masks are removed. Then, all generated QA items undergo human review to ensure correct phrasing and grounding consistency. A plausible implication is that this rigorous multi-stage process yields both semantic and visual fidelity in the evaluation set.

4. Benchmark Tasks and Evaluation Protocols

LoomBench organizes its assessment into three linked subtasks:

a. Temporal Grounding (“When?”)

  • Input: Full video plus a natural-language query.
  • Output: A temporal interval G^=[t^start,t^end]\hat{G} = [\hat t_{\rm start}, \hat t_{\rm end}].
  • Metrics:
    • Recall@1@IoU (R1@m): Fraction of questions where temporal Intersection over Union (tIoU) with ground truth exceeds threshold mm.
    • Average Temporal IoU (tIoU).

Formally, if GG is ground truth and G^\hat G is predicted, then tIoU(G,G^)=GG^GG^.{\rm tIoU}(G, \hat G) = \frac{|G \cap \hat G|}{|G \cup \hat G|}\,.

b. Referring Segmentation (“Where?”)

  • Input: Clipped segment V[t^start,t^end]V_{[\hat t_{\rm start}, \hat t_{\rm end}]} and a query.
  • Output: Frame-by-frame binary masks M^t\hat M_t for each frame tt.
  • Metric: JdataF\mathcal{J{data}F}, the mean of region similarity (J)(\mathcal{J}) and contour accuracy (F)(\mathcal{F}) across frames:

JdataF=12Tt=1T[J(Pt,Gt)+F(Pt,Gt)]\mathcal{J{data}F} = \frac{1}{2T}\sum_{t=1}^T \left[\mathcal{J}(P_t, G_t) + \mathcal{F}(P_t, G_t)\right]

where PtP_t and GtG_t denote predicted and ground-truth masks respectively.

c. Compositional Grounding & Segmentation (“Combined”)

  • Input: Full video and a compositional query.
  • Output: Both temporal interval G^\hat G and mask sequence M^tG^\hat M_{t \in \hat G}.
  • Metrics: tIoU for interval and Bidirectional Foreground JdataFbi-fore\mathcal{J{data}F}_{\rm bi\text{-}fore} for masks, defined as:

JdataFbi-fore=(Jp+Fp)(Jg+Fg)(Jp+Fp)+(Jg+Fg)\mathcal{J{data}F}_{\rm bi\text{-}fore} = \frac{(\mathcal{J}_p + \mathcal{F}_p)(\mathcal{J}_g + \mathcal{F}_g)}{(\mathcal{J}_p + \mathcal{F}_p) + (\mathcal{J}_g + \mathcal{F}_g)}

where (Jp,Fp)(\mathcal{J}_p, \mathcal{F}_p) are scores on predicted mask spans and (Jg,Fg)(\mathcal{J}_g, \mathcal{F}_g) on ground-truth spans.

5. Empirical Results and Baseline Comparisons

LoomBench provides a comparative platform for temporal-only (TimeSuite 7B), spatial-only (Sa2VA 8B), cascade baselines, and the unified VideoLoom 8B model. Evaluation on all question types demonstrates the limitations of modality-specific models and the superiority of joint spatial–temporal architectures.

Method When ([email protected], tIoU) Where (JdataF\mathcal{J{data}F}) Combined (tIoU, JdataFbi-fore\mathcal{J{data}F}_{\rm bi\text{-}fore})
TimeSuite 7B 23.1, 27.6
Sa2VA 8B 86.1
TimeSuite+Sa2VA 25.4, 33.7
VideoLoom 8B 37.9, 39.7 87.2 41.6, 49.1

VideoLoom 8B attains a 12.1 point tIoU improvement and 14.8 point rise in [email protected] over TimeSuite on temporal queries. For spatial tasks, VideoLoom matches Sa2VA, and for Combined queries it provides +16.2 points in tIoU and +15.4 in JdataFbi-fore\mathcal{J{data}F}_{\rm bi\text{-}fore} over the cascade, indicating the value of non-cascaded, multi-modal modeling.

6. Example Queries and Qualitative Evaluation

LoomBench includes prompt–response exemplars to illustrate evaluation depth:

  • When: “When does the man in overalls flip the wrench open and start tightening the bolt?” → “Between 12.6 s and 14.2 s.”
  • Where: “Where is the person when they press the red button?” → “The person is standing next to the control panel on the right side—see the highlighted mask around frames 34–36.”
  • Combined: “Where is the person in dark clothing when he throws the pink frisbee and the dog leaps to catch it?” → “During seconds 5.8–7.3 (frisbee throw), the person appears in the lower‐left quadrant—see the segmentation masks over those frames.”

Even concise queries require precise temporal localization and pixel-level mask prediction over varying spans, evidencing the benchmark’s multi-modal stress-test function. This suggests the suite’s capacity to uncover model deficiencies in spatial–temporal integration.

7. Implications for Future Research in Video-Language Modeling

LoomBench establishes a unified protocol for evaluating joint spatial–temporal reasoning in Video LLMs, demonstrating the inadequacy of modular or cascade approaches compared to fully integrated models. Its granular annotation and rigorously verified QA pairs offer both a challenge and a measurement tool for innovation in multimodal video understanding. A plausible implication is that future benchmarks may adopt similar compositional and multi-step query structures to catalyze progress toward general video intelligence and compositional perception (Shi et al., 12 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LoomBench.