Papers
Topics
Authors
Recent
Search
2000 character limit reached

MWSIS: Multimodal Weakly Supervised Instance Segmentation with 2D Box Annotations for Autonomous Driving

Published 12 Dec 2023 in cs.CV | (2312.06988v4)

Abstract: Instance segmentation is a fundamental research in computer vision, especially in autonomous driving. However, manual mask annotation for instance segmentation is quite time-consuming and costly. To address this problem, some prior works attempt to apply weakly supervised manner by exploring 2D or 3D boxes. However, no one has ever successfully segmented 2D and 3D instances simultaneously by only using 2D box annotations, which could further reduce the annotation cost by an order of magnitude. Thus, we propose a novel framework called Multimodal Weakly Supervised Instance Segmentation (MWSIS), which incorporates various fine-grained label generation and correction modules for both 2D and 3D modalities to improve the quality of pseudo labels, along with a new multimodal cross-supervision approach, named Consistency Sparse Cross-modal Supervision (CSCS), to reduce the inconsistency of multimodal predictions by response distillation. Particularly, transferring the 3D backbone to downstream tasks not only improves the performance of the 3D detectors, but also outperforms fully supervised instance segmentation with only 5% fully supervised annotations. On the Waymo dataset, the proposed framework demonstrates significant improvements over the baseline, especially achieving 2.59% mAP and 12.75% mAP increases for 2D and 3D instance segmentation tasks, respectively. The code is available at https://github.com/jiangxb98/mwsis-plugin.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. A Dataset for Semantic Segmentation of Point Cloud Sequences. ArXiv preprint, abs/1904.01416.
  2. nuScenes: A Multimodal Dataset for Autonomous Driving. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 11618–11628. IEEE.
  3. ScribbleSeg: Scribble-based Interactive Image Segmentation. ArXiv preprint, abs/2303.11320.
  4. Semi-Supervised Semantic Segmentation With Cross Pseudo Supervision. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, 2613–2622. Computer Vision Foundation / IEEE.
  5. Pointly-Supervised Instance Segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, 2607–2616. IEEE.
  6. Per-Pixel Classification is Not All You Need for Semantic Segmentation. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y. N.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 17864–17875.
  7. Box2Mask: Weakly Supervised 3D Semantic Instance Segmentation using Bounding Boxes. In European Conference on Computer Vision.
  8. Object Counting and Instance Segmentation With Image-Level Supervision. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, 12397–12405. Computer Vision Foundation / IEEE.
  9. RWSeg: Cross-graph Competing Random Walks for Weakly Supervised 3D Instance Segmentation. ArXiv preprint, abs/2208.05110.
  10. Fully sparse 3d object detection. Advances in Neural Information Processing Systems, 35: 351–363.
  11. Label-PEnet: Sequential Label Propagation and Enhancement Networks for Weakly Supervised Instance Segmentation. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, 3344–3353. IEEE.
  12. Are we ready for autonomous driving? The KITTI vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, 3354–3361. IEEE Computer Society.
  13. Mask R-CNN. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2980–2988. IEEE Computer Society.
  14. A new Direct Connected Component Labeling and Analysis Algorithms for GPUs. In 2018 Conference on Design and Architectures for Signal and Image Processing (DASIP). Porto, Portugal.
  15. Weakly Supervised Instance Segmentation using the Bounding Box Tightness Prior. In Wallach, H. M.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E. B.; and Garnett, R., eds., Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, 6582–6593.
  16. SQN: Weakly-Supervised Semantic Segmentation of Large-Scale 3D Point Clouds with 1000x Fewer Labels. In European Conference on Computer Vision.
  17. PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 4866–4875. IEEE.
  18. Guided Collaborative Training for Pixel-wise Semi-Supervised Learning. ArXiv preprint, abs/2008.05258.
  19. Segment Anything. ArXiv preprint, abs/2304.02643.
  20. Weakly Supervised Segmentation of Small Buildings with Point Labels. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, 7386–7395. IEEE.
  21. BBAM: Bounding Box Attribution Map for Weakly Supervised Semantic and Instance Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, 2643–2652. Computer Vision Foundation / IEEE.
  22. Lwsis: Lidar-guided weakly supervised instance segmentation for autonomous driving. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 1433–1441.
  23. Multimodal Transformer for Automatic 3D Annotation and Object Detection. In European Conference on Computer Vision.
  24. MAP-Gen: An Automated 3D-Box Annotation Flow with Multimodal Attention Point Generator. 2022 26th International Conference on Pattern Recognition (ICPR), 1148–1155.
  25. Waymo Open Dataset: Panoramic Video Panoptic Segmentation. ArXiv preprint, abs/2206.07704.
  26. WeakM3D: Towards Weakly Supervised Monocular 3D Object Detection. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  27. Weakly supervised 3d object detection from point clouds. In Proceedings of the 28th ACM International Conference on Multimedia, 4144–4152.
  28. Parallel Detection-and-Segmentation Learning for Weakly Supervised Instance Segmentation. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, 8178–8188. IEEE.
  29. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE transactions on pattern analysis and machine intelligence, 43(8): 2647–2664.
  30. Scalability in Perception for Autonomous Driving: Waymo Open Dataset. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 2443–2451. IEEE.
  31. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Guyon, I.; von Luxburg, U.; Bengio, S.; Wallach, H. M.; Fergus, R.; Vishwanathan, S. V. N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 1195–1204.
  32. Conditional convolutions for instance segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, 282–298. Springer.
  33. FCOS: Fully Convolutional One-Stage Object Detection. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, 9626–9635. IEEE.
  34. BoxInst: High-Performance Instance Segmentation With Box Annotations. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, 5443–5452. Computer Vision Foundation / IEEE.
  35. Scribble-Supervised LiDAR Semantic Segmentation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2687–2697.
  36. Weakly-Supervised Instance Segmentation via Class-Agnostic Learning With Salient Images. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, 10225–10235. Computer Vision Foundation / IEEE.
  37. Multi-Path Region Mining for Weakly Supervised 3D Semantic Segmentation on Point Clouds. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 4383–4392. IEEE.
  38. FGR: Frustum-Aware Geometric Reasoning for Weakly Supervised 3D Vehicle Detection. 2021 IEEE International Conference on Robotics and Automation (ICRA), 4348–4354.
  39. ScribbleVC: Scribble-supervised Medical Image Segmentation with Vision-Class Embedding. ArXiv preprint, abs/2307.16226.
  40. 3D Instances as 1D Kernels. In European Conference on Computer Vision.

Summary

  • The paper introduces MWSIS, a framework that uses 2D box annotations to guide both 2D and 3D instance segmentation.
  • It employs fine-grained label correction modules and a cross-modal supervision method to enhance weakly supervised training.
  • Evaluations on the Waymo dataset show significant improvements over baselines, reducing annotation costs while nearly matching fully supervised performance.

Introduction

The field of autonomous driving technology heavily relies on the ability to visually interpret the surrounding environment. Within this domain, instance segmentation is a critical task that necessitates splitting distinct objects, such as cars, pedestrians, and cyclists, from the backdrop. A significant burden, however, is imposed by the necessity for precisely annotated data, which is both costly and laborious to produce. In practical scenarios, manual annotation with pixel-level precision for instance segmentation is a daunting task.

To navigate this challenge, many studies have adopted weakly supervised methods that require simpler annotations, such as bounding boxes, which are less precise but easier to obtain. Weak supervision, though, often results in lower quality training signals and thus poorer model performance.

Multimodal Weak Supervision

In a bid to ameliorate the high cost and labor involved in manual annotation, a new framework known as Multimodal Weakly Supervised Instance Segmentation (MWSIS) is introduced. This framework leverages only 2D box annotations to simultaneously guide the training of both 2D and 3D instance segmentors. MWSIS incorporates a suite of fine-grained label correction modules for each modality, and also introduces a new multimodal cross-supervision approach to further reconcile the learning process from both two-dimensional and three-dimensional perspectives.

Fine-Grained Label Correction Modules

The MWSIS framework consists of several key components that facilitate the training of more accurate models under weak supervision:

  • Instance-based Pseudo Mask Generation (IPG) Module: uses predictions for self-supervised correction in 2D pseudo label generation.
  • Spatial-based Pseudo Label Generation (SPG) Module: exploits spatial prior information from the point cloud to generate better 3D pseudo labels.
  • Point-based Voting Label Correction (PVC) Module: employs historical predictions for further refining the generated pseudo labels.
  • Ring Segment-based Label Correction (RSC) Module: utilizes the depth information of the point cloud to refine predictions.

Cross-modal Supervision

The novel cross-modal supervision method, named Consistency Sparse Cross-modal Supervision (CSCS), utilizes the complementary properties of image and point cloud modalities to enhance segmentor performance. CSCS improves the consistency of multimodal predictions by applying distillation approaches across both 2D and 3D predictions.

Evaluation and Results

The MWSIS framework was evaluated using the Waymo Open Dataset and demonstrated significant improvements over baseline methods on both 2D and 3D instance segmentation tasks. Its code has been made publicly available, promoting transparency and further research.

Impact and Advancements

Effectively, the MWSIS framework positions itself as a strong contender against fully supervised methods, with remarkable efficiency improvements in instance segmentation under weak supervision. It achieves this with only a fraction of the full supervision data required by other methods, demonstrating its utility in reducing annotation load and potentially serving as a pre-training method to enhance performance on additional downstream tasks.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub