ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: TREK-150 Single Object Tracking
Abstract: The Associating Objects with Transformers (AOT) framework has exhibited exceptional performance in a wide range of complex scenarios for video object tracking and segmentation. In this study, we convert the bounding boxes to masks in reference frames with the help of the Segment Anything Model (SAM) and Alpha-Refine, and then propagate the masks to the current frame, transforming the task from Video Object Tracking (VOT) to video object segmentation (VOS). Furthermore, we introduce MSDeAOT, a variant of the AOT series that incorporates transformers at multiple feature scales. MSDeAOT efficiently propagates object masks from previous frames to the current frame using two feature scales of 16 and 8. As a testament to the effectiveness of our design, we achieved the 1st place in the EPIC-KITCHENS TREK-150 Object Tracking Challenge.
- Seqtrack: Sequence to sequence learning for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14572–14581, 2023.
- Global contrast based salient region detection. IEEE transactions on pattern analysis and machine intelligence, 37(3):569–582, 2014.
- Segment and track anything. arXiv preprint arXiv:2305.06558, 2023.
- Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13608–13618, 2022.
- Epic-kitchens visor benchmark: Video segmentations and object relations. In Proceedings of the Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2022.
- Mose: A new dataset for video object segmentation in complex scenes. arXiv preprint arXiv:2302.01872, 2023.
- Is first person vision challenging for object tracking? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct 2021.
- Visual object tracking in first person vision. International Journal of Computer Vision (IJCV), 2022.
- The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
- Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5374–5383, 2019.
- Semantic contours from inverse detectors. In 2011 international conference on computer vision, pages 991–998. IEEE, 2011.
- Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE transactions on pattern analysis and machine intelligence, 43(5):1562–1577, 2019.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- The tenth visual object tracking vot2022 challenge results. In Computer Vision–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII, pages 431–460. Springer, 2023.
- Unified mask embedding and correspondence learning for self-supervised video segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18706–18716, 2023.
- Locality-aware inter-and intra-video reconstruction for self-supervised correspondence learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8719–8730, 2022.
- Swintrack: A simple and strong baseline for transformer tracking. Advances in Neural Information Processing Systems, 35:16743–16754, 2022.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Large-scale video panoptic segmentation in the wild: A benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21033–21043, 2022.
- Fast video object segmentation by reference-guided mask propagation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7376–7385, 2018.
- Robust visual tracking by segmentation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pages 571–588. Springer, 2022.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Hierarchical image saliency detection on extended cssd. IEEE transactions on pattern analysis and machine intelligence, 38(4):717–729, 2015.
- Egotracks: A long-term egocentric visual object tracking dataset. arXiv preprint arXiv:2301.03213, 2023.
- Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018.
- Alpha-refine: Boosting tracking performance by precise bounding box estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5289–5298, 2021.
- Collaborative video object segmentation by foreground-background integration. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V, pages 332–348. Springer, 2020.
- Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems, 34:2491–2502, 2021.
- Collaborative video object segmentation by multi-scale foreground-background integration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4701–4712, 2021.
- Decoupling features in hierarchical propagation for video object segmentation. In Advances in Neural Information Processing Systems.
- A survey on deep learning technique for video segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7099–7122, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.