- The paper introduces a novel ADSD model that fuses RGB and depth data using attention mechanisms for semantic segmentation.
- The dual-branch decoder, incorporating ASPP, addresses data imbalance and enhances multi-task supervision for robust feature learning.
- Experimental results on NYUDv2 and SUN-RGBD demonstrate superior performance with a 52.5% mIoU metric.
Attention-based Dual Supervised Decoder for RGBD Semantic Segmentation
Introduction
The field of RGBD semantic segmentation involves the challenging task of using both RGB color and depth data to classify each pixel of an image into a semantic category. This task is of particular importance due to its applications across various domains such as augmented reality (AR), virtual reality (VR), and autonomous systems. The complexity arises from processing joint reasoning from both modalities—RGB and depth (D)—and effectively utilizing them in semantic segmentation tasks.
Overview of the Proposed Architecture
The paper proposes an architecture termed as the Attention-based Dual Supervised Decoder (ADSD) designed specifically for RGBD semantic segmentation. The core of this architecture lies in its novel attention-based dual supervised decoder, which comprises a two-stream encoder and a dual-branch decoder that enables robust multi-modal fusion and task supervision (Figure 1).
Figure 1: Overview of the proposed ADSD architecture detailing the two-stream encoder and dual-branch decoder.
Encoder Design and Multi-modal Fusion
The encoder employs a two-stream design to separately process RGB and depth information using ResNet-50 as the backbone. The innovative aspect here is the Attention-based Multi-modal Fusion (AMF) module which is engineered to effectively extract and fuse multi-level complementary information from both modalities. The AMF uses channel attention mechanisms to emphasize important channels, which allows the network to prioritize useful feature maps and thereby enhances semantic segmentation performance.

Figure 2: Detailed structure of channel attention and spatial attention used in AMF.
Dual-branch Decoder
The dual-branch decoder is engineered to address the imbalance between encoder and decoder typically observed in semantic segmentation models. This decoder comprises a primary branch for semantic segmentation, aided by a secondary branch that can be tasked with normal estimation, depth estimation, or additional semantic tasks to provide supervision. To improve the multi-scale context processing and segmentation accuracy, Atrous Spatial Pyramid Pooling (ASPP) is integrated into the primary branch (Figure 3).
Figure 3: Diagram of the proposed dual-branch decoder showing integration of ASPP for enhanced semantic segmentation.
Experimental Results
The model was evaluated on two benchmark datasets: NYUDv2 and SUN-RGBD, demonstrating superior performance over existing state-of-the-art techniques. On NYUDv2, the ADSD achieved a mean Intersection over Union (mIoU) of 52.5%, showcasing improvements credited to the multi-level and multi-task learning strategies adopted in the decoder. The dual-branch decoder facilitated convergence and reduced the training difficulty caused by data imbalance, further illustrated by the rapid reduction in loss values during training (Figure 4).
Figure 4: Statistics of loss values during training highlighting the convergence improvement with ADSD.
Conclusion
The ADSD model presents a significant advancement in leveraging multi-modal information for RGBD semantic segmentation. The use of attention mechanisms and a dual-branched approach enables robust feature fusion and effective task supervision, leading to enhanced segmentation performance. Future research could explore extending these techniques to other vision applications, focusing on network efficiency and broader applicability.