A Good Foundation is Worth Many Labels: Label-Efficient Panoptic Segmentation
Abstract: A key challenge for the widespread application of learning-based models for robotic perception is to significantly reduce the required amount of annotated training data while achieving accurate predictions. This is essential not only to decrease operating costs but also to speed up deployment time. In this work, we address this challenge for PAnoptic SegmenTation with fEw Labels (PASTEL) by exploiting the groundwork paved by visual foundation models. We leverage descriptive image features from such a model to train two lightweight network heads for semantic segmentation and object boundary detection, using very few annotated training samples. We then merge their predictions via a novel fusion module that yields panoptic maps based on normalized cut. To further enhance the performance, we utilize self-training on unlabeled images selected by a feature-driven similarity scheme. We underline the relevance of our approach by employing PASTEL to important robot perception use cases from autonomous driving and agricultural robotics. In extensive experiments, we demonstrate that PASTEL significantly outperforms previous methods for label-efficient segmentation even when using fewer annotations. The code of our work is publicly available at http://pastel.cs.uni-freiburg.de.
- B. Cheng, M. D. Collins, Y. Zhu, T. Liu, T. S. Huang, H. Adam, and L.-C. Chen, “Panoptic-DeepLab: A simple, strong, and fast baseline for bottom-up panoptic segmentation,” in Conf. on Comput. Vis. and Pattern Recog., 2020, pp. 12 472–12 482.
- B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” in Conf. on Comput. Vis. and Pattern Recog., 2022, pp. 1280–1289.
- R. Mohan and A. Valada, “Perceiving the invisible: Proposal-free amodal panoptic segmentation,” Rob. and Autom. Letters, vol. 7, no. 4, pp. 9302–9309, 2022.
- M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes dataset for semantic urban scene understanding,” in Conf. on Comput. Vis. and Pattern Recog., 2016, pp. 3213–3223.
- W. Li, Y. Yuan, S. Wang, J. Zhu, J. Li, J. Liu, and L. Zhang, “Point2Mask: Point-supervised panoptic segmentation via optimal transport,” in Int. Conf. on Comput. Vis., October 2023, pp. 572–581.
- Y. Li, H. Zhao, X. Qi, Y. Chen, L. Qi, L. Wang, Z. Li, J. Sun, and J. Jia, “Fully convolutional networks for panoptic segmentation with point-based supervision,” Trans. on Pattern Anal. and Mach. Intell., vol. 45, no. 4, pp. 4552–4568, 2023.
- L. Hoyer, D. Dai, Y. Chen, A. Köring, S. Saha, and L. Van Gool, “Three ways to improve semantic segmentation with self-supervised depth estimation,” in Conf. on Comput. Vis. and Pattern Recog., 2021, pp. 11 125–11 135.
- L. Yang, W. Zhuo, L. Qi, Y. Shi, and Y. Gao, “ST++: Make self-training work better for semi-supervised semantic segmentation,” in Conf. on Comput. Vis. and Pattern Recog., 2022, pp. 4258–4267.
- Y. Wang, H. Wang, Y. Shen, J. Fei, W. Li, G. Jin, L. Wu, R. Zhao, and X. Le, “Semi-supervised semantic segmentation using unreliable pseudo-labels,” in Conf. on Comput. Vis. and Pattern Recog., 2022, pp. 4238–4247.
- J. Hyun Cho, U. Mall, K. Bala, and B. Hariharan, “PiCIE: Unsupervised semantic segmentation using invariance and equivariance in clustering,” in Conf. on Comput. Vis. and Pattern Recog., 2021, pp. 16 789–16 799.
- X. Wang, R. Girdhar, S. X. Yu, and I. Misra, “Cut and learn for unsupervised object detection and instance segmentation,” in Conf. on Comput. Vis. and Pattern Recog., 2023, pp. 3124–3134.
- M. Hamilton, Z. Zhang, B. Hariharan, N. Snavely, and W. T. Freeman, “Unsupervised semantic segmentation by distilling feature correspondences,” in Int. Conf. on Learn. Represent., 2022.
- M. Käppeler, K. Petek, N. Vödisch, W. Burgard, and A. Valada, “Few-shot panoptic segmentation with foundation models,” Intern. Conf. on Rob. and Autom., 2024.
- R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
- M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M. Szafraniec, et al., “DINOv2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
- J. Shi and J. Malik, “Normalized cuts and image segmentation,” Trans. on Pattern Anal. and Mach. Intell., vol. 22, no. 8, pp. 888–905, 2000.
- M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The Pascal visual object classes (VOC) challenge,” Int. J. of Comput. Vis., vol. 88, pp. 303–338, 2010.
- J. Weyler, F. Magistri, E. Marks, Y. L. Chong, M. Sodano, G. Roggiolani, N. Chebrolu, C. Stachniss, and J. Behley, “PhenoBench – A large dataset and benchmarks for semantic image interpretation in the agricultural domain,” arXiv preprint arXiv:2306.04557, 2023.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, et al., “Language models are few-shot learners,” in Conf. on Neural Inform. Process. Syst., H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33, 2020, pp. 1877–1901.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, et al., “Learning transferable visual models from natural language supervision,” in Conf. on Robot Learning, vol. 139, 2021, pp. 8748–8763.
- Y. Lin, M. Chen, W. Wang, B. Wu, K. Li, B. Lin, H. Liu, and X. He, “CLIP is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation,” in Conf. on Comput. Vis. and Pattern Recog., 2023, pp. 15 305–15 314.
- L. Yuan, D. Chen, Y.-L. Chen, N. Codella, X. Dai, J. Gao, H. Hu, X. Huang, B. Li, C. Li, et al., “Florence: A new foundation model for computer vision,” arXiv preprint arXiv:2111.11432, 2021.
- X. Wang, W. Wang, Y. Cao, C. Shen, and T. Huang, “Images speak in images: A generalist painter for in-context visual learning,” in Conf. on Comput. Vis. and Pattern Recog., 2023, pp. 6830–6839.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
- M. Caron, H. Touvron, I. Misra, H. Jegou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Int. Conf. on Comput. Vis., 2021, pp. 9630–9640.
- W. Shen, Z. Peng, X. Wang, H. Wang, J. Cen, D. Jiang, et al., “A survey on label-efficient deep image segmentation: Bridging the gap between weak supervision and dense prediction,” Trans. on Pattern Anal. and Mach. Intell., vol. 45, no. 8, pp. 9284–9305, 2023.
- W. Van Gansbeke, S. Vandenhende, S. Georgoulis, and L. Van Gool, “Unsupervised semantic segmentation by contrasting object mask proposals,” in Int. Conf. on Comput. Vis., 2021, pp. 10 032–10 042.
- N. Vödisch, K. Petek, W. Burgard, and A. Valada, “CoDEPS: Online continual learning for depth estimation and panoptic segmentation,” Robotics: Science and Systems, 2023.
- T. Pohlen, A. Hermans, M. Mathias, and B. Leibe, “Full-resolution residual networks for semantic segmentation in street scenes,” in Conf. on Comput. Vis. and Pattern Recog., 2017, pp. 3309–3318.
- N. Vödisch, D. Cattaneo, W. Burgard, and A. Valada, “Continual SLAM: Beyond lifelong simultaneous localization and mapping through continual learning,” in Robotics Research, 2023, pp. 19–35.
- B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik, “Semantic contours from inverse detectors,” in Int. Conf. on Comput. Vis., 2011, pp. 991–998.
- W. Van Gansbeke, S. Vandenhende, and L. Van Gool, “Discovering object masks with transformers for unsupervised semantic segmentation,” arXiv preprint arXiv:2206.06363, 2022.
- K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Int. Conf. on Comput. Vis., 2017, pp. 2980–2988.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Conf. on Comput. Vis. and Pattern Recog., 2016, pp. 770–778.
- L.-C. Chen, R. G. Lopes, B. Cheng, M. D. Collins, E. D. Cubuk, B. Zoph, H. Adam, and J. Shlens, “Naive-Student: Leveraging semi-supervised learning in video sequences for urban scene segmentation,” in Eur. Conf. on Comput. Vis., 2020, pp. 695–714.
- X. Wang, Z. Yu, S. De Mello, J. Kautz, A. Anandkumar, C. Shen, and J. M. Alvarez, “FreeSOLO: Learning to segment objects without annotations,” in Conf. on Comput. Vis. and Pattern Recog., 2022, pp. 14 156–14 166.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.