VideoMolmo: Spatio-Temporal Grounding Meets Pointing

Published 5 Jun 2025 in cs.CV | (2506.05336v1)

Abstract: Spatio-temporal localization is vital for precise interactions across diverse domains, from biological research to autonomous navigation and interactive interfaces. Current video-based approaches, while proficient in tracking, lack the sophisticated reasoning capabilities of LLMs, limiting their contextual understanding and generalization. We introduce VideoMolmo, a large multimodal model tailored for fine-grained spatio-temporal pointing conditioned on textual descriptions. Building upon the Molmo architecture, VideoMolmo incorporates a temporal module utilizing an attention mechanism to condition each frame on preceding frames, ensuring temporal consistency. Additionally, our novel temporal mask fusion pipeline employs SAM2 for bidirectional point propagation, significantly enhancing coherence across video sequences. This two-step decomposition, i.e., first using the LLM to generate precise pointing coordinates, then relying on a sequential mask-fusion module to produce coherent segmentation, not only simplifies the task for the LLM but also enhances interpretability. Due to the lack of suitable datasets, we curate a comprehensive dataset comprising 72k video-caption pairs annotated with 100k object points. To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also evaluate our model on Referring Video Object Segmentation (Refer-VOS) and Reasoning VOS tasks. In comparison to existing models, VideoMolmo substantially improves spatio-temporal pointing accuracy and reasoning capability. Our code and models are publicly available at https://github.com/mbzuai-oryx/VideoMolmo.

Abstract PDF Upgrade to Chat

Summary

The paper introduces VideoMolmo, a multimodal model that significantly improves spatio-temporal pointing accuracy by decomposing localization into pointing and mask generation phases.
It employs advanced cross-attention and bidirectional point propagation (SAM2) to ensure temporal consistency and accurate object localization across video frames.
Performance results on VPoS-Bench demonstrate that VideoMolmo outperforms previous models in precision, recall, and F1 scores, boosting complex video scene interpretation.

Analysis of "VideoMolmo: Spatio-Temporal Grounding Meets Pointing"

The academic paper introduces VideoMolmo, a multimodal model aimed at improving spatio-temporal pointing accuracy in dynamic visual environments by leveraging textual descriptions. Developed as an extension of Molmo, VideoMolmo incorporates a novel architecture to address the shortcomings of existing video-based approaches, particularly in complex reasoning and accurate object localization. The framework employs sophisticated attention mechanisms for fine-grained temporal analysis, which facilitates more coherent grounding of visual elements across video sequences.

Architectural and Methodological Insights

VideoMolmo's architecture is centered around a spatio-temporal localization task, decomposed into two distinct phases: pointing and segmentation mask generation. This decomposition conceptually simplifies the task by delegating precise coordinate generation to the LLM, and subsequently utilizing a mask-fusion module for coherent segmentation. Notably, the introduction of a temporal module, utilizing cross-attention mechanisms, ensures that each frame considers historical contexts, thereby boosting temporal consistency. Furthermore, SAM2 is employed to propagate points bidirectionally, promoting integration consistency across frames.

Critically, VideoMolmo is trained using a newly curated dataset—illustrating the model's capacity for generalized reasoning based on vast and varied video-caption pairs. The resourcefulness of this dataset provides substantial quantitative evaluation capabilities, evident from the accompanying VPoS-Bench designed as a challenging benchmark across diverse real-world scenarios.

Performance Evaluation

The empirical analysis within the paper underscores the strengths of VideoMolmo across multiple dimensions. It significantly improves spatio-temporal pointing accuracy relative to its predecessors, as demonstrated in the VPoS-Bench evaluations. Specifically, VideoMolmo outperforms baseline approaches such as Molmo+SAM2, VideoLISA, and VideoGLaMM in precision, recall, and F1 scores, highlighting its superior capabilities in complex video environments. Moreover, VideoMolmo exhibits notable improvements in tasks like object counting, referring video segmentation, as well as reasoning segmentation—important practical applications underpinning the competitive advantage of its design.

Implications and Speculation on Future AI Developments

The implications of VideoMolmo's enhanced reasoning capabilities are manifold. Practically, it can be effectively integrated into domains requiring refined analysis of dynamic scenes, like autonomous driving, robotics, and assistive technology. Theoretically, it aligns with ongoing trends in AI development where models must employ complex reasoning to differentiate and interpret dense visual data.

Looking ahead, future developments may focus on expanding the model's ability to handle scenarios with rapid object movements—an acknowledged limitation within the current framework. Further enhancements could consider multi-point predictions per object, potentially addressing observed shortcomings in mask quality. These avenues of research promise to extend VideoMolmo's applicability and robustness within the ambit of AI-driven video comprehension tools.

In conclusion, VideoMolmo represents a substantial advancement in the nuanced processing of video datasets, blending sophisticated temporal analysis with real-world utility and theoretical expansion. Its success invites further exploration, particularly in optimizing temporal dynamics and grounding precision across varied applications within AI.

Markdown Report Issue