Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding

Published 25 Nov 2023 in cs.CV | (2311.15075v1)

Abstract: Large-scale image-language pretrained models, e.g., CLIP, have demonstrated remarkable proficiency in acquiring general multi-modal knowledge through web-scale image-text data. Despite the impressive performance of image-LLMs on various image tasks, how to effectively expand them on general video understanding remains an area of ongoing exploration. In this paper, we investigate the image-to-video transferring from the perspective of the model and the data, unveiling two key obstacles impeding the adaptation of image-LLMs: non-generalizable temporal modeling and partially misaligned video-text data. To address these challenges, we propose Spatial-Temporal Auxiliary Network with Mutual-guided alignment module (Mug-STAN), a simple yet effective framework extending image-text model to diverse video tasks and video-text data.Specifically, STAN adopts a branch structure with decomposed spatial-temporal modules to enable generalizable temporal modeling, while Mug suppresses misalignment by introducing token-wise feature aggregation of either modality from the other. Extensive experimental results verify Mug-STAN significantly improves adaptation of language-image pretrained models such as CLIP and CoCa at both video-text post-pretraining and finetuning stages. With our solution, state-of-the-art zero-shot and finetuning results on various downstream datasets, including MSR-VTT, DiDeMo, LSMDC, Kinetics-400, Something-Something-2, HMDB-51, UCF- 101, and AVA, are achieved. Moreover, by integrating pretrained Mug-STAN with the emerging multimodal dialogue model, we can realize zero-shot video chatting. Codes are available at https://github.com/farewellthree/STAN

Abstract PDF HTML Upgrade to Chat

References (80)

Citations (3)

View on Semantic Scholar

Summary

The paper presents Mug-STAN, which integrates spatial-temporal learning and mutual-guided alignment to overcome video-text misalignment.
It employs a branch structure (STAN) that reuses pretrained visual layers for efficient temporal modeling across multiple feature levels.
Empirical results demonstrate state-of-the-art performance in text-video retrieval and action recognition on diverse video benchmarks.

Overview of Mug-STAN: Adaptation of Image-LLMs for General Video Understanding

The proliferation of large-scale image-language pretrained models, notably CLIP, has showcased significant advancements by leveraging massive web-scale image-text datasets. Despite their success in various image-centric tasks, the extension of such models to the domain of video understanding remains an elusive challenge. The research paper titled "Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding" presents a structured approach to bridge this gap by addressing two principal barriers: the lack of effective temporal modeling and partial misalignment between video and text data.

Methodology and Contributions

The paper introduces the Spatial-Temporal Auxiliary Network with Mutual-guided alignment module (Mug-STAN). This framework serves as a robust solution to enhance the adaptability of image-LLMs for video understanding. The key components, STAN and Mug, address temporal modeling and video-text misalignment, respectively.

1. Spatial-Temporal Auxiliary Network (STAN):

STAN functions as a branch alongside the pretrained visual encoder, facilitating temporal learning by integrating spatial-temporal contexts at multiple levels. Unlike the posterior and intermediate structures used in traditional methods, STAN's branch structure enables:

Multi-Level Feature Utilization: By leveraging features at different abstraction levels from the pretrained model, STAN captures both high-level semantic alignments and low-level spatial-temporal patterns.
Parameter Efficient Temporal Modeling: Exploiting a separated spatial-temporal design, STAN reuses the structure of the pretrained visual layers, which aids in efficient temporal understanding without disrupting the pretrained knowledge.

2. Mutual-Guided Alignment (Mug):

Mug targets the prevalent partial misalignment issues in video-text datasets by:

Token-Frame Interaction Modeling: It performs token-wise interaction between frames and text, dynamically identifying and aligning the most relevant parts of the two modalities.
Feature Aggregation through Mutual Guidance: The cross-modal enhancement allows more accurate representation by amplifying corresponding segments and suppressing irrelevant noise, thus improving overall alignment.

Empirical Evaluation

The efficacy of Mug-STAN is demonstrated through extensive experiments across multiple video-related tasks including text-video retrieval, action recognition, and temporal action localization. Notable results include:

Superior Performance in Zero-Shot and Finetuning Settings: Mug-STAN achieves state-of-the-art results on datasets such as MSR-VTT, DiDeMo, LSMDC, Kinetics-400, and Something-Something-v2. The integration of pretrained Mug-STAN into multimodal dialogue models further showcased the capability of zero-shot video chatting.
Improved Generalization: When compared to existing models, Mug-STAN demonstrated enhanced generalization across diverse tasks, attributed to its effective temporal modeling and amelioration of cross-modal misalignment.

Future Directions

The paper proposes several implications for future research:

Application to Diverse V-L Pretrained Models: The flexibility and robust performance of Mug-STAN suggest potential adaptation to various V-L pretrained architectures beyond CLIP and CoCa.
Post-Pretraining on Diverse Datasets: The framework shows promise in post-pretraining settings using datasets with varying noise levels, such as WebVid10M and HowTo100M.
Integration with Multimodal Architectures: Leveraging STAN’s capabilities in video temporal modeling could facilitate enhanced integration in larger multimodal LLM systems.

In summary, Mug-STAN elegantly addresses the core challenges hindering the extension of image-LLMs to video tasks. By leveraging its novel mechanism for temporal modeling and cross-modal alignment, the framework proves itself as a powerful tool in the field of video understanding, laying groundwork for both theoretical exploration and practical applications in AI.