Mask Scoring R-CNN

Published 1 Mar 2019 in cs.CV | (1903.00241v1)

Abstract: Letting a deep network be aware of the quality of its own predictions is an interesting yet important problem. In the task of instance segmentation, the confidence of instance classification is used as mask quality score in most instance segmentation frameworks. However, the mask quality, quantified as the IoU between the instance mask and its ground truth, is usually not well correlated with classification score. In this paper, we study this problem and propose Mask Scoring R-CNN which contains a network block to learn the quality of the predicted instance masks. The proposed network block takes the instance feature and the corresponding predicted mask together to regress the mask IoU. The mask scoring strategy calibrates the misalignment between mask quality and mask score, and improves instance segmentation performance by prioritizing more accurate mask predictions during COCO AP evaluation. By extensive evaluations on the COCO dataset, Mask Scoring R-CNN brings consistent and noticeable gain with different models, and outperforms the state-of-the-art Mask R-CNN. We hope our simple and effective approach will provide a new direction for improving instance segmentation. The source code of our method is available at \url{https://github.com/zjhuang22/maskscoring_rcnn}.

Abstract PDF Upgrade to Chat

Citations (861)

View on Semantic Scholar

Summary

The paper introduces an innovative MaskIoU head to align mask quality with detection accuracy.
It refines instance segmentation by fusing RoI features with predicted masks for precise IoU estimation.
Experiments on COCO demonstrate a consistent AP improvement of approximately 1.5% across various backbones.

Mask Scoring R-CNN: A Comprehensive Overview

Instance segmentation, an essential task in computer vision, aims to classify each pixel of an image into distinct object instances. While Mask R-CNN, a prevailing framework in this domain, relies on classification confidence to score instance masks, this strategy often misaligns mask quality with detection accuracy. The paper, "Mask Scoring R-CNN," introduces an enhanced approach to addressing this misalignment by incorporating a MaskIoU head for more accurate mask scoring.

Technical Contributions

The paper presents several technical contributions:

Introduction of Mask Scoring R-CNN: The authors propose an augmentation to the Mask R-CNN framework, termed Mask Scoring R-CNN (MS R-CNN). This novel approach includes an additional MaskIoU head aimed explicitly at scoring the instance masks based on the Intersection-over-Union (IoU) between the predicted masks and their ground truths.
MaskIoU Head Design: The MaskIoU head is designed to refine mask scoring by learning the IoU directly from the instance features and the predicted masks. This head integrates the RoI features and the predicted mask into a series of convolutional and fully connected layers to predict the MaskIoU.
Incremental Performance Gains: The Mask Scoring R-CNN consistently outperforms the Mask R-CNN framework. Through extensive experimentation on the COCO dataset, the paper demonstrates a notable AP improvement of about 1.5% across different models, including ResNet-18 FPN, ResNet-50 FPN, and ResNet-101 FPN.

Experimental Validation

The experimental results substantiate the efficacy of MS R-CNN:

Robustness Across Backbones: The experiments indicate that MS R-CNN provides consistent performance improvements regardless of the backbone network used. For instance, employing ResNet-101 FPN along with MS R-CNN yields significant AP gains.
Framework Versatility: The integration of MaskIoU head extends beyond the original Mask R-CNN framework, showing performance boosts in other configurations such as Faster R-CNN, FPN, and DCN+FPN.
COCO Benchmark Performance: On the COCO 2017 test-dev, MS R-CNN achieves superior results compared to existing instance segmentation frameworks. Particularly, with ResNet-101 DCN+FPN, the approach attains an AP of 39.6%, compared to 38.4% from the baseline Mask R-CNN.

Architectural Considerations

The introduction of the MaskIoU head necessitates specific architectural updates:

Input Fusion: Various methods for fusing the predicted mask and RoI features were explored. The recommended design involves concatenating the score map of the target class with the RoI feature.
Training Targets: Effective training requires focusing on the IoU of the target class, demonstrating that regressing MaskIoU solely for the relevant category yields optimal performance.

Implications and Future Directions

The Mask Scoring R-CNN framework introduces a nuanced method to score instance masks, meticulously aligning scores with actual mask quality. The implications of this work are profound, offering immediate enhancements in the precision of instance segmentation tasks, which is vital for applications in autonomous driving, video surveillance, and medical imaging.

Additionally, the robust performance of MS R-CNN across diverse backbone networks and its seamless adaptability into different instance segmentation frameworks denotes its potential to become a standard integration in future research. Future directions might explore further optimization of the MaskIoU head, potentially incorporating more sophisticated learning mechanisms to predict IoU, or extending the approach to other computer vision tasks, such as semantic segmentation and panoptic segmentation.

In conclusion, "Mask Scoring R-CNN" offers a concrete advancement in instance segmentation by accurately scoring mask predictions. The work proves to be a valuable addition, bolstering the reliability of deep learning models in producing refined, high-quality instance segmentations.