ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning

Published 27 Jun 2022 in cs.CV | (2206.13559v3)

Abstract: Capitalizing on large pre-trained models for various downstream tasks of interest have recently emerged with promising performance. Due to the ever-growing model size, the standard full fine-tuning based task adaptation strategy becomes prohibitively costly in terms of model training and storage. This has led to a new research direction in parameter-efficient transfer learning. However, existing attempts typically focus on downstream tasks from the same modality (e.g., image understanding) of the pre-trained model. This creates a limit because in some specific modalities, (e.g., video understanding) such a strong pre-trained model with sufficient knowledge is less or not available. In this work, we investigate such a novel cross-modality transfer learning setting, namely parameter-efficient image-to-video transfer learning. To solve this problem, we propose a new Spatio-Temporal Adapter (ST-Adapter) for parameter-efficient fine-tuning per video task. With a built-in spatio-temporal reasoning capability in a compact design, ST-Adapter enables a pre-trained image model without temporal knowledge to reason about dynamic video content at a small (~8%) per-task parameter cost, requiring approximately 20 times fewer updated parameters compared to previous work. Extensive experiments on video action recognition tasks show that our ST-Adapter can match or even outperform the strong full fine-tuning strategy and state-of-the-art video models, whilst enjoying the advantage of parameter efficiency. The code and model are available at https://github.com/linziyi96/st-adapter

Abstract PDF Upgrade to Chat

Citations (156)

View on Semantic Scholar

Summary

The paper introduces ST-Adapter, a compact module that enables efficient adaptation of image models for video tasks with only about 8% parameter updates.
It integrates a depth-wise 3D convolution into a bottleneck structure, providing effective spatio-temporal reasoning with minimal computational overhead.
Experiments show that ST-Adapter achieves on-par or superior performance compared to full fine-tuning, making it ideal for resource-limited applications.

Overview of "ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning"

The paper "ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning" explores an innovative approach to improving video understanding by efficiently adapting large pre-trained image models for video tasks. In recent years, the adaptability of large pre-trained models has been increasingly recognized, yet the inherent costs associated with full fine-tuning for each task pose significant challenges, particularly in cross-modality scenarios, such as transferring knowledge from image to video tasks. This paper introduces the Spatio-Temporal Adapter (ST-Adapter), specifically designed to facilitate efficient fine-tuning that balances parameter usage with performance efficacy.

Key Contributions

Problem Addressing in Cross-Modality Transfer: The paper addresses the challenge of adapting large image-based models to video tasks without incurring the high computational cost associated with full-model fine-tuning. The proposed ST-Adapter fills the gap by enabling efficient image-to-video transfer learning, facilitating the use of strong pre-trained image models for video tasks, notably action recognition.
Spatio-Temporal Adapter Architecture: The ST-Adapter is introduced as a compact module that embeds spatio-temporal reasoning ability into existing large image models. It integrates a depth-wise 3D convolution into a bottleneck structure, enabling effective temporal modeling with minimal additional parameters. This novel approach ensures that only a small fraction of the model's parameters need to be updated for each downstream video task, achieving significant reduction in parameter costs (~8% per task).
Experimental Validation and Benchmarking: The paper presents comprehensive experimental validation across multiple video action recognition tasks, demonstrating that the ST-Adapter matches or exceeds the performance of both traditional full fine-tuning strategies and state-of-the-art video models. The significant findings include ST-Adapters outperforming other parameter-efficient alternatives and fully fine-tuned models, showcasing superior parameter efficiency and training cost advantages.
Implications for Real-World Applications: The proposed method is particularly relevant for practical applications where computational resources are limited, offering a scalable and resource-efficient alternative to full model training. The ST-Adapter's design is straightforward, leveraging common operators, which facilitates easy implementation and scalable deployment across various platforms.
Future Directions in AI: This research underscores the importance of parameter-efficient transfer learning as foundational models grow in size and complexity. It paves the way for future explorations into cross-modality learning, highlighting the potential to leverage existing powerful models in modalities where equivalent pre-trained models may not be available.

Conclusion

The ST-Adapter presents a significant advancement in the domain of parameter-efficient transfer learning by effectively enabling cross-modality knowledge transfer from image to video understanding tasks. This work contributes to optimizing computational resources while maintaining high performance levels, signifying a promising direction for the deployment and scalability of AI models in multimedia and action recognition applications.

Markdown Report Issue