MWSIS: Multimodal Weakly Supervised Instance Segmentation with 2D Box Annotations for Autonomous Driving

Published 12 Dec 2023 in cs.CV | (2312.06988v4)

Abstract: Instance segmentation is a fundamental research in computer vision, especially in autonomous driving. However, manual mask annotation for instance segmentation is quite time-consuming and costly. To address this problem, some prior works attempt to apply weakly supervised manner by exploring 2D or 3D boxes. However, no one has ever successfully segmented 2D and 3D instances simultaneously by only using 2D box annotations, which could further reduce the annotation cost by an order of magnitude. Thus, we propose a novel framework called Multimodal Weakly Supervised Instance Segmentation (MWSIS), which incorporates various fine-grained label generation and correction modules for both 2D and 3D modalities to improve the quality of pseudo labels, along with a new multimodal cross-supervision approach, named Consistency Sparse Cross-modal Supervision (CSCS), to reduce the inconsistency of multimodal predictions by response distillation. Particularly, transferring the 3D backbone to downstream tasks not only improves the performance of the 3D detectors, but also outperforms fully supervised instance segmentation with only 5% fully supervised annotations. On the Waymo dataset, the proposed framework demonstrates significant improvements over the baseline, especially achieving 2.59% mAP and 12.75% mAP increases for 2D and 3D instance segmentation tasks, respectively. The code is available at https://github.com/jiangxb98/mwsis-plugin.

Abstract PDF HTML Upgrade to Chat

References (40)

Summary

The paper introduces MWSIS, a framework that uses 2D box annotations to guide both 2D and 3D instance segmentation.
It employs fine-grained label correction modules and a cross-modal supervision method to enhance weakly supervised training.
Evaluations on the Waymo dataset show significant improvements over baselines, reducing annotation costs while nearly matching fully supervised performance.

Introduction

The field of autonomous driving technology heavily relies on the ability to visually interpret the surrounding environment. Within this domain, instance segmentation is a critical task that necessitates splitting distinct objects, such as cars, pedestrians, and cyclists, from the backdrop. A significant burden, however, is imposed by the necessity for precisely annotated data, which is both costly and laborious to produce. In practical scenarios, manual annotation with pixel-level precision for instance segmentation is a daunting task.

To navigate this challenge, many studies have adopted weakly supervised methods that require simpler annotations, such as bounding boxes, which are less precise but easier to obtain. Weak supervision, though, often results in lower quality training signals and thus poorer model performance.

Multimodal Weak Supervision

In a bid to ameliorate the high cost and labor involved in manual annotation, a new framework known as Multimodal Weakly Supervised Instance Segmentation (MWSIS) is introduced. This framework leverages only 2D box annotations to simultaneously guide the training of both 2D and 3D instance segmentors. MWSIS incorporates a suite of fine-grained label correction modules for each modality, and also introduces a new multimodal cross-supervision approach to further reconcile the learning process from both two-dimensional and three-dimensional perspectives.

Fine-Grained Label Correction Modules

The MWSIS framework consists of several key components that facilitate the training of more accurate models under weak supervision:

Instance-based Pseudo Mask Generation (IPG) Module: uses predictions for self-supervised correction in 2D pseudo label generation.
Spatial-based Pseudo Label Generation (SPG) Module: exploits spatial prior information from the point cloud to generate better 3D pseudo labels.
Point-based Voting Label Correction (PVC) Module: employs historical predictions for further refining the generated pseudo labels.
Ring Segment-based Label Correction (RSC) Module: utilizes the depth information of the point cloud to refine predictions.

The novel cross-modal supervision method, named Consistency Sparse Cross-modal Supervision (CSCS), utilizes the complementary properties of image and point cloud modalities to enhance segmentor performance. CSCS improves the consistency of multimodal predictions by applying distillation approaches across both 2D and 3D predictions.

Evaluation and Results

The MWSIS framework was evaluated using the Waymo Open Dataset and demonstrated significant improvements over baseline methods on both 2D and 3D instance segmentation tasks. Its code has been made publicly available, promoting transparency and further research.

Impact and Advancements

Effectively, the MWSIS framework positions itself as a strong contender against fully supervised methods, with remarkable efficiency improvements in instance segmentation under weak supervision. It achieves this with only a fraction of the full supervision data required by other methods, demonstrating its utility in reducing annotation load and potentially serving as a pre-training method to enhance performance on additional downstream tasks.

Markdown Report Issue