LoomBench: Spatial–Temporal Evaluation Suite
- LoomBench is a benchmark suite designed to evaluate joint spatial–temporal reasoning in Video LLMs using curated video QA pairs.
- It systematically challenges models with 'When', 'Where', and combined queries, integrating temporal localization and spatial segmentation.
- Baseline evaluations, including VideoLoom 8B, show notable improvements in tIoU and segmentation accuracy over modular approaches.
LoomBench is a purpose-built benchmark suite designed to comprehensively evaluate the spatial–temporal reasoning capabilities of Video LLMs (Video LLMs). It systematically probes event grounding in time, localization of actors in space, and complex joint “Where-when” reasoning through a fine-grained set of temporally, spatially, and compositionally challenging question–answer pairs, advancing standards in multimodal intelligence and video understanding (Shi et al., 12 Jan 2026).
1. Motivation for Joint Spatial–Temporal Evaluation
Modern Video LLMs exhibit distinct strengths on either temporal reasoning (e.g., identifying when an event occurs) or spatial perception (e.g., determining the location of actors), but seldom demonstrate robust joint comprehension. LoomBench was conceived to challenge models not only with isolated “When?” and “Where?” queries but also with compositional questions requiring the integration of temporal grounding and spatial segmentation. This design reflects a need to benchmark progress toward unified spatial–temporal understanding, rather than proficiency in isolated modalities.
2. Scale, Source Material, and Evaluation Policy
LoomBench encompasses 130 short video clips sourced from the ActivityNet Validation set, each averaging 17.6 seconds and 4.2 shots per video. The evaluation suite comprises 1,484 meticulously curated question–answer pairs, distributed as follows:
- 541 “When” questions: Purely temporal queries (e.g., “When does the person in dark clothing pick up the pink frisbee?”)
- 487 “Where” questions: Spatial localization after temporal interval identification (e.g., “Where is the person when they pick up the pink frisbee?”)
- 456 “Combined” questions: Compositional challenges demanding temporal identification followed by spatial segmentation (e.g., “Where is the person in dark clothing when they throw the pink frisbee and the dog leaps to catch it?”)
LoomBench is designed strictly for evaluation: no training or validation splits are provided; all examples are reserved for end-to-end testing.
| Category | Count | Example Query |
|---|---|---|
| When | 541 | “When does the person in dark clothing pick up the pink frisbee?” |
| Where | 487 | “Where is the person when they pick up the pink frisbee?” |
| Combined | 456 | “Where is the person in dark clothing when they throw the pink frisbee and the dog leaps …” |
3. Data Collection and Annotation Pipeline
The LoomBench annotation pipeline extends the approach introduced in LoomData-8.7k, employing the following process on ActivityNet Val shots:
- Automatic Shot Partitioning: PySceneDetect with Kernel Temporal Segmentation (KTS) splits videos into contiguous shots.
- Main-Character Detection: GroundingDINO identifies principal actors within each shot.
- Instance Segmentation and Tracking: Segment Anything Model v2 (SAM2) tracks masks of detected characters through shots.
- Action Description Generation: Gemini 2.5 links action descriptions to frame IDs.
- QA Pair Synthesis: LLaMA 3.1, prompted with detailed descriptions, generates candidate question-answer pairs.
Quality assurance involves two rounds of manual verification. First, videos lacking mask tracklets or containing inaccurate masks are removed. Then, all generated QA items undergo human review to ensure correct phrasing and grounding consistency. A plausible implication is that this rigorous multi-stage process yields both semantic and visual fidelity in the evaluation set.
4. Benchmark Tasks and Evaluation Protocols
LoomBench organizes its assessment into three linked subtasks:
a. Temporal Grounding (“When?”)
- Input: Full video plus a natural-language query.
- Output: A temporal interval .
- Metrics:
- Recall@1@IoU (R1@m): Fraction of questions where temporal Intersection over Union (tIoU) with ground truth exceeds threshold .
- Average Temporal IoU (tIoU).
Formally, if is ground truth and is predicted, then
b. Referring Segmentation (“Where?”)
- Input: Clipped segment and a query.
- Output: Frame-by-frame binary masks for each frame .
- Metric: , the mean of region similarity and contour accuracy across frames:
where and denote predicted and ground-truth masks respectively.
c. Compositional Grounding & Segmentation (“Combined”)
- Input: Full video and a compositional query.
- Output: Both temporal interval and mask sequence .
- Metrics: tIoU for interval and Bidirectional Foreground for masks, defined as:
where are scores on predicted mask spans and on ground-truth spans.
5. Empirical Results and Baseline Comparisons
LoomBench provides a comparative platform for temporal-only (TimeSuite 7B), spatial-only (Sa2VA 8B), cascade baselines, and the unified VideoLoom 8B model. Evaluation on all question types demonstrates the limitations of modality-specific models and the superiority of joint spatial–temporal architectures.
| Method | When ([email protected], tIoU) | Where () | Combined (tIoU, ) |
|---|---|---|---|
| TimeSuite 7B | 23.1, 27.6 | – | – |
| Sa2VA 8B | – | 86.1 | – |
| TimeSuite+Sa2VA | – | – | 25.4, 33.7 |
| VideoLoom 8B | 37.9, 39.7 | 87.2 | 41.6, 49.1 |
VideoLoom 8B attains a 12.1 point tIoU improvement and 14.8 point rise in [email protected] over TimeSuite on temporal queries. For spatial tasks, VideoLoom matches Sa2VA, and for Combined queries it provides +16.2 points in tIoU and +15.4 in over the cascade, indicating the value of non-cascaded, multi-modal modeling.
6. Example Queries and Qualitative Evaluation
LoomBench includes prompt–response exemplars to illustrate evaluation depth:
- When: “When does the man in overalls flip the wrench open and start tightening the bolt?” → “Between 12.6 s and 14.2 s.”
- Where: “Where is the person when they press the red button?” → “The person is standing next to the control panel on the right side—see the highlighted mask around frames 34–36.”
- Combined: “Where is the person in dark clothing when he throws the pink frisbee and the dog leaps to catch it?” → “During seconds 5.8–7.3 (frisbee throw), the person appears in the lower‐left quadrant—see the segmentation masks over those frames.”
Even concise queries require precise temporal localization and pixel-level mask prediction over varying spans, evidencing the benchmark’s multi-modal stress-test function. This suggests the suite’s capacity to uncover model deficiencies in spatial–temporal integration.
7. Implications for Future Research in Video-Language Modeling
LoomBench establishes a unified protocol for evaluating joint spatial–temporal reasoning in Video LLMs, demonstrating the inadequacy of modular or cascade approaches compared to fully integrated models. Its granular annotation and rigorously verified QA pairs offer both a challenge and a measurement tool for innovation in multimodal video understanding. A plausible implication is that future benchmarks may adopt similar compositional and multi-step query structures to catalyze progress toward general video intelligence and compositional perception (Shi et al., 12 Jan 2026).