Papers
Topics
Authors
Recent
Search
2000 character limit reached

Attention-based Dual Supervised Decoder for RGBD Semantic Segmentation

Published 5 Jan 2022 in cs.CV and eess.IV | (2201.01427v2)

Abstract: Encoder-decoder models have been widely used in RGBD semantic segmentation, and most of them are designed via a two-stream network. In general, jointly reasoning the color and geometric information from RGBD is beneficial for semantic segmentation. However, most existing approaches fail to comprehensively utilize multimodal information in both the encoder and decoder. In this paper, we propose a novel attention-based dual supervised decoder for RGBD semantic segmentation. In the encoder, we design a simple yet effective attention-based multimodal fusion module to extract and fuse deeply multi-level paired complementary information. To learn more robust deep representations and rich multi-modal information, we introduce a dual-branch decoder to effectively leverage the correlations and complementary cues of different tasks. Extensive experiments on NYUDv2 and SUN-RGBD datasets demonstrate that our method achieves superior performance against the state-of-the-art methods.

Citations (13)

Summary

  • The paper introduces a novel ADSD model that fuses RGB and depth data using attention mechanisms for semantic segmentation.
  • The dual-branch decoder, incorporating ASPP, addresses data imbalance and enhances multi-task supervision for robust feature learning.
  • Experimental results on NYUDv2 and SUN-RGBD demonstrate superior performance with a 52.5% mIoU metric.

Attention-based Dual Supervised Decoder for RGBD Semantic Segmentation

Introduction

The field of RGBD semantic segmentation involves the challenging task of using both RGB color and depth data to classify each pixel of an image into a semantic category. This task is of particular importance due to its applications across various domains such as augmented reality (AR), virtual reality (VR), and autonomous systems. The complexity arises from processing joint reasoning from both modalities—RGB and depth (D)—and effectively utilizing them in semantic segmentation tasks.

Overview of the Proposed Architecture

The paper proposes an architecture termed as the Attention-based Dual Supervised Decoder (ADSD) designed specifically for RGBD semantic segmentation. The core of this architecture lies in its novel attention-based dual supervised decoder, which comprises a two-stream encoder and a dual-branch decoder that enables robust multi-modal fusion and task supervision (Figure 1). Figure 1

Figure 1: Overview of the proposed ADSD architecture detailing the two-stream encoder and dual-branch decoder.

Encoder Design and Multi-modal Fusion

The encoder employs a two-stream design to separately process RGB and depth information using ResNet-50 as the backbone. The innovative aspect here is the Attention-based Multi-modal Fusion (AMF) module which is engineered to effectively extract and fuse multi-level complementary information from both modalities. The AMF uses channel attention mechanisms to emphasize important channels, which allows the network to prioritize useful feature maps and thereby enhances semantic segmentation performance. Figure 2

Figure 2

Figure 2: Detailed structure of channel attention and spatial attention used in AMF.

Dual-branch Decoder

The dual-branch decoder is engineered to address the imbalance between encoder and decoder typically observed in semantic segmentation models. This decoder comprises a primary branch for semantic segmentation, aided by a secondary branch that can be tasked with normal estimation, depth estimation, or additional semantic tasks to provide supervision. To improve the multi-scale context processing and segmentation accuracy, Atrous Spatial Pyramid Pooling (ASPP) is integrated into the primary branch (Figure 3). Figure 3

Figure 3: Diagram of the proposed dual-branch decoder showing integration of ASPP for enhanced semantic segmentation.

Experimental Results

The model was evaluated on two benchmark datasets: NYUDv2 and SUN-RGBD, demonstrating superior performance over existing state-of-the-art techniques. On NYUDv2, the ADSD achieved a mean Intersection over Union (mIoU) of 52.5%, showcasing improvements credited to the multi-level and multi-task learning strategies adopted in the decoder. The dual-branch decoder facilitated convergence and reduced the training difficulty caused by data imbalance, further illustrated by the rapid reduction in loss values during training (Figure 4). Figure 4

Figure 4: Statistics of loss values during training highlighting the convergence improvement with ADSD.

Conclusion

The ADSD model presents a significant advancement in leveraging multi-modal information for RGBD semantic segmentation. The use of attention mechanisms and a dual-branched approach enables robust feature fusion and effective task supervision, leading to enhanced segmentation performance. Future research could explore extending these techniques to other vision applications, focusing on network efficiency and broader applicability.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.