ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: Semi-Supervised Video Object Segmentation
Abstract: The Associating Objects with Transformers (AOT) framework has exhibited exceptional performance in a wide range of complex scenarios for video object segmentation. In this study, we introduce MSDeAOT, a variant of the AOT series that incorporates transformers at multiple feature scales. Leveraging the hierarchical Gated Propagation Module (GPM), MSDeAOT efficiently propagates object masks from previous frames to the current frame using a feature scale with a stride of 16. Additionally, we employ GPM in a more refined feature scale with a stride of 8, leading to improved accuracy in detecting and tracking small objects. Through the implementation of test-time augmentations and model ensemble techniques, we achieve the top-ranking position in the EPIC-KITCHEN VISOR Semi-supervised Video Object Segmentation Challenge.
- Global contrast based salient region detection. IEEE transactions on pattern analysis and machine intelligence, 37(3):569–582, 2014.
- Segment and track anything. arXiv preprint arXiv:2305.06558, 2023.
- Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (IJCV), 130:33–55, 2022.
- Epic-kitchens visor benchmark: Video segmentations and object relations. In Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2022.
- The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
- Semantic contours from inverse detectors. In 2011 international conference on computer vision, pages 991–998. IEEE, 2011.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Unified mask embedding and correspondence learning for self-supervised video segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18706–18716, 2023.
- Locality-aware inter-and intra-video reconstruction for self-supervised correspondence learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8719–8730, 2022.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Fast video object segmentation by reference-guided mask propagation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7376–7385, 2018.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Hierarchical image saliency detection on extended cssd. IEEE transactions on pattern analysis and machine intelligence, 38(4):717–729, 2015.
- Pixel-level matching for video object segmentation using convolutional neural networks. In Proceedings of the IEEE international conference on computer vision, pages 2167–2176, 2017.
- Breaking the” object” in video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22836–22845, 2023.
- Feelvos: Fast end-to-end embedding learning for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9481–9490, 2019.
- Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018.
- Collaborative video object segmentation by foreground-background integration. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V, pages 332–348. Springer, 2020.
- Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems, 34:2491–2502, 2021.
- Collaborative video object segmentation by multi-scale foreground-background integration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4701–4712, 2021.
- Decoupling features in hierarchical propagation for video object segmentation. In Advances in Neural Information Processing Systems.
- A survey on deep learning technique for video segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7099–7122, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.