Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
Abstract: Large pre-trained vision models achieve impressive success in computer vision. However, fully fine-tuning large models for downstream tasks, particularly in video understanding, can be prohibitively computationally expensive. Recent studies turn their focus towards efficient image-to-video transfer learning. Nevertheless, existing efficient fine-tuning methods lack attention to training memory usage and exploration of transferring a larger model to the video domain. In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video. Specifically, we introduce a lightweight spatial-temporal side network attached to the frozen vision model, which avoids the backpropagation through the heavy pre-trained model and utilizes multi-level spatial features from the original image model. Extremely memory-efficient architecture enables our method to reduce 75% memory usage than previous adapter-based methods. In this way, we can transfer a huge ViT-E (4.4B) for video understanding tasks which is 14x larger than ViT-L (304M). Our approach achieves remarkable performance on various video datasets across unimodal and cross-modal tasks (i.e., action recognition and text-video retrieval), especially in Something-Something V1&V2 (67.3% & 74.6%), Kinetics-400 (88.6%), MSR-VTT (52.3%), MSVD (56.1%) and VATEX (68.8%). We release our code at https://github.com/HJYao00/Side4Video.
- Vivit: A video vision transformer. In ICCV, pages 6836–6846, 2021.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Is space-time attention all you need for video understanding? In ICML, page 4, 2021.
- Cross modal retrieval with querybank normalisation. In CVPR, pages 5194–5205, 2022.
- Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017.
- Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190–200, 2011.
- Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint arXiv:2109.04290, 2021.
- Reproducible scaling laws for contrastive language-image learning. In CVPR, pages 2818–2829, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. NeurIPS, 35:16344–16359, 2022.
- Scaling vision transformers to 22 billion parameters. In ICML, pages 7480–7512. PMLR, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- Uatvr: Uncertainty-adaptive text-video retrieval. In ICCV, pages 13723–13733, 2023.
- Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097, 2021.
- Multi-modal transformer for video retrieval. In ECCV, pages 214–229. Springer, 2020.
- The” something something” video database for learning and evaluating visual common sense. In ICCV, pages 5842–5850, 2017.
- Parameter-efficient transfer learning for nlp. In ICML, pages 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. In ICLR, 2021.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pages 448–456. pmlr, 2015.
- Cross-modal adapter for text-video retrieval. arXiv preprint arXiv:2211.09623, 2022.
- Prompting visual-language models for efficient video understanding. In ECCV, pages 105–124. Springer, 2022.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Uniformerv2: Unlocking the potential of image vits for video understanding. In ICCV, pages 1632–1643, 2023.
- Tsm: Temporal shift module for efficient video understanding. In ICCV, pages 7083–7093, 2019.
- Frozen clip models are efficient video learners. In ECCV, pages 388–404. Springer, 2022.
- Revisiting temporal modeling for clip-based image-to-video knowledge transferring. In CVPR, pages 6555–6564, 2023.
- Ts2-net: Token shift and selection transformer for text-video retrieval. In ECCV, pages 319–335. Springer, 2022a.
- Video swin transformer. In CVPR, pages 3202–3211, 2022b.
- Uniadapter: Unified parameter-efficient transfer learning for cross-modal modeling. arXiv preprint arXiv:2302.06605, 2023.
- Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, pages 2630–2640, 2019.
- Expanding language-image pretrained models for general video recognition. In ECCV, pages 1–18. Springer, 2022.
- St-adapter: Parameter-efficient image-to-video transfer learning. NeurIPS, 35:26462–26477, 2022.
- Disentangling spatial and temporal learning for efficient image-to-video transfer learning. In ICCV, pages 13934–13944, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35:25278–25294, 2022.
- Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
- Lst: Ladder side-tuning for parameter and memory efficient transfer learning. NeurIPS, 35:12991–13005, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.
- Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111, 2022.
- Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, pages 4581–4591, 2019.
- Mvfnet: Multi-view fusion network for efficient video recognition. In AAAI, pages 2943–2951, 2021a.
- Dsanet: Dynamic segment aggregation network for video-level representation learning. In ACM MM, pages 1903–1911, 2021b.
- Cap4video: What can auxiliary captions do for text-video retrieval? In CVPR, pages 10704–10713, 2023a.
- What can simple arithmetic operations do for temporal modeling? In ICCV, pages 13712–13722, 2023b.
- Revisiting classifier: Transferring vision-language models for video recognition. In AAAI, pages 2847–2855, 2023c.
- Transferring vision-language models for visual recognition: A classifier perspective. IJCV, 2023d.
- Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In CVPR, pages 6620–6630, 2023e.
- Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016.
- Side adapter network for open-vocabulary semantic segmentation. In CVPR, pages 2945–2954, 2023.
- Chinese clip: Contrastive vision-language pretraining in chinese. arXiv preprint arXiv:2211.01335, 2022.
- Aim: Adapting image models for efficient video action recognition. arXiv preprint arXiv:2302.03024, 2023.
- Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022.
- Scaling vision transformers. In CVPR, pages 12104–12113, 2022.
- Token shift transformer for video classification. In ACM MM, pages 917–925, 2021.
- Side-tuning: a baseline for network adaptation via additive side networks. In ECCV, pages 698–714. Springer, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.