Spatio-temporal Prompting Network for Robust Video Feature Extraction
Abstract: Frame quality deterioration is one of the main challenges in the field of video understanding. To compensate for the information loss caused by deteriorated frames, recent approaches exploit transformer-based integration modules to obtain spatio-temporal information. However, these integration modules are heavy and complex. Furthermore, each integration module is specifically tailored for its target task, making it difficult to generalise to multiple tasks. In this paper, we present a neat and unified framework, called Spatio-Temporal Prompting Network (STPN). It can efficiently extract robust and accurate video features by dynamically adjusting the input features in the backbone network. Specifically, STPN predicts several video prompts containing spatio-temporal information of neighbour frames. Then, these video prompts are prepended to the patch embeddings of the current frame as the updated input for video feature extraction. Moreover, STPN is easy to generalise to various video tasks because it does not contain task-specific modules. Without bells and whistles, STPN achieves state-of-the-art performance on three widely-used datasets for different video understanding tasks, i.e., ImageNetVID for video object detection, YouTubeVIS for video instance segmentation, and GOT-10k for visual object tracking. Code is available at https://github.com/guanxiongsun/vfe.pytorch.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Fully-convolutional siamese networks for object tracking. In ECCV, 2016.
- Language models are few-shot learners. In NeurIPS, 2020.
- End-to-end object detection with transformers. In ECCV, 2020.
- Distributed deep learning model for intelligent video surveillance systems with edge computing. Industrial Informatics, 2019.
- Transformer tracking. In CVPR, 2021.
- Memory enhanced global-local aggregation for video object detection. In CVPR, 2020.
- Mask2former for video instance segmentation. arXiv preprint arXiv:2112.10764, 2021.
- Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
- Target transformed regression for accurate tracking. arXiv preprint arXiv:2104.00403, 2021.
- Mixformer: End-to-end tracking with iterative mixed attention. In CVPR, 2022.
- Probabilistic regression for visual tracking. In CVPR, 2020.
- Relation distillation networks for video object detection. In ICCV, 2019.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
- Large-scale adversarial training for vision-and-language representation learning. NeurIPS, 2020.
- Mask r-cnn. In ICCV, 2017.
- End-to-end video object detection with spatial-temporal transformers. In ACMMM, 2021.
- Vita: Video instance segmentation via object token association. In NeurIPS, 2022.
- Relation networks for object detection. In CVPR, 2018.
- Minvis: A minimal video instance segmentation framework without video-based training. In NeurIPS, 2022.
- Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE transactions on pattern analysis and machine intelligence, 43(5):1562–1577, 2019.
- Video instance segmentation using inter-frame communication transformers. NeurIPS, 2021.
- Visual prompt tuning. arXiv preprint arXiv:2203.12119, 2022.
- Prompting visual-language models for efficient video understanding. In ECCV, 2022.
- Large-scale video classification with convolutional neural networks. In CVPR, 2014.
- Learning background-aware correlation filters for visual tracking. In ICCV, 2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Siamrpn++: Evolution of siamese visual tracking with very deep networks. In CVPR, 2019.
- High performance visual tracking with siamese region proposal network. In CVPR, 2018.
- Feature pyramid networks for object detection. In CVPR, 2017.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Computing systems for autonomous driving: State of the art and challenges. IEEE Internet of Things Journal, 8(8):6469–6486, 2020.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Eigen-cam: Class activation map using principal components. In IJCNN, 2020.
- Spatiotemporal anomaly detection using deep learning for real-time video surveillance. Industrial Informatics, 16(1):393–402, 2019.
- Expanding language-image pretrained models for general video recognition. In ECCV, 2022.
- St-adapter: Parameter-efficient image-to-video transfer learning for action recognition. In NeurIPS, 2022.
- Pytorch: An imperative style, high-performance deep learning library. In NIPS, 2019.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Language models are unsupervised multitask learners. OpenAI blog, 2019.
- Zero-shot text-to-image generation. In ICML, 2021.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
- Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
- Mamba: Multi-level aggregation via memory bank for video object detection. In AAAI, 2021.
- Efficient one-stage video object detection by exploiting temporal consistency. In ECCV, 2022.
- Tdvit: Temporal dilated video transformer for dense video tasks. In ECCV, 2022.
- Mlp-mixer: An all-mlp architecture for vision. In NeurIPS, 2021.
- End-to-end representation learning for correlation filter based tracking. In CVPR, 2017.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Attention is all you need. In NeurIPS, 2017.
- Siam R-CNN: visual tracking by re-detection. In CVPR, 2020.
- Transformer meets tracker: Exploiting temporal context for robust visual tracking. In CVPR, 2021.
- Fast online object tracking and segmentation: A unifying approach. In CVPR, 2019.
- Non-local neural networks. In CVPR, 2018.
- End-to-end video instance segmentation with transformers. In CVPR, 2021.
- A survey on video action recognition in sports: Datasets, methods and applications. Multimedia, 2022.
- Sequence level semantics aggregation for video object detection. In ICCV, 2019.
- Cvt: Introducing convolutions to vision transformers. In ICCV, 2021.
- Seqformer: Sequential transformer for video instance segmentation. In ECCV, 2022.
- In defense of online models for video instance segmentation. In ECCV, 2022.
- Centernet heatmap propagation for real-time video object detection. In ECCV, 2020.
- Learning spatio-temporal transformer for visual tracking. In ICCV, 2021.
- Video instance segmentation. In ICCV, 2019.
- A survey of autonomous driving: Common practices and emerging technologies. IEEE access, 8:58443–58469, 2020.
- Unified vision and language prompt learning. arXiv preprint arXiv:2210.07225, 2022.
- Flow-guided feature aggregation for video object detection. In ICCV, 2017.
- Distractor-aware siamese networks for visual object tracking. In ECCV, 2018.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.