HAPNet: Toward Superior RGB-Thermal Scene Parsing via Hybrid, Asymmetric, and Progressive Heterogeneous Feature Fusion
Abstract: Data-fusion networks have shown significant promise for RGB-thermal scene parsing. However, the majority of existing studies have relied on symmetric duplex encoders for heterogeneous feature extraction and fusion, paying inadequate attention to the inherent differences between RGB and thermal modalities. Recent progress in vision foundation models (VFMs) trained through self-supervision on vast amounts of unlabeled data has proven their ability to extract informative, general-purpose features. However, this potential has yet to be fully leveraged in the domain. In this study, we take one step toward this new research area by exploring a feasible strategy to fully exploit VFM features for RGB-thermal scene parsing. Specifically, we delve deeper into the unique characteristics of RGB and thermal modalities, thereby designing a hybrid, asymmetric encoder that incorporates both a VFM and a convolutional neural network. This design allows for more effective extraction of complementary heterogeneous features, which are subsequently fused in a dual-path, progressive manner. Moreover, we introduce an auxiliary task to further enrich the local semantics of the fused features, thereby improving the overall performance of RGB-thermal scene parsing. Our proposed HAPNet, equipped with all these components, demonstrates superior performance compared to all other state-of-the-art RGB-thermal scene parsing networks, achieving top ranks across three widely used public RGB-thermal scene parsing datasets. We believe this new paradigm has opened up new opportunities for future developments in data-fusion scene parsing approaches.
- S. S. Shivakumar et al., “PST900: RGB-Thermal Calibration, Dataset and Segmentation Network,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 9441–9447.
- W. Zhou et al., “DBCNet: Dynamic Bilateral Cross-Fusion Network for RGB-T Urban Scene Understanding in Intelligent Vehicles,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 53, no. 12, pp. 7631–7641, 2023.
- M. Liang et al., “Explicit Attention-Enhanced Fusion for RGB-Thermal Perception Tasks,” IEEE Robotics and Automation Letters, vol. 8, no. 7, pp. 4060–4067, 2023.
- Y. Sun et al., “RTFNet: RGB-Thermal Fusion Network for Semantic Segmentation of Urban Scenes,” IEEE Robotics and Automation Letters, vol. 4, no. 3, pp. 2576–2583, 2019.
- Q. Ha et al., “MFNet: Towards Real-Time Semantic Segmentation for Autonomous Vehicles with Multi-Spectral Scenes,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 5108–5115.
- Q. Zhang et al., “ABMDRNet: Adaptive-weighted Bi-directional Modality Difference Reduction Network for RGB-T Semantic Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2633–2642.
- U. Shin et al., “Complementary Random Masking for RGB-Thermal Semantic Segmentation,” arXiv preprint arXiv:2303.17386, 2023.
- J. Zhang et al., “CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 12, pp. 14 679–14 694, 2023.
- W. Zhou et al., “GMNet: Graded-Feature Multilabel-Learning Network for RGB-Thermal Urban Scene Semantic Segmentation,” IEEE Transactions on Image Processing, vol. 30, pp. 7790–7802, 2021.
- Y. Lv et al., “Context-Aware Interaction Network for RGB-T Semantic Segmentation,” IEEE Transactions on Multimedia, 2024, DOI:10.1109/TMM.2023.3349072.
- W. Zhou et al., “CACFNet: Cross-Modal Attention Cascaded Fusion Network for RGB-T Urban Scene Parsing,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 1919–1929, 2023.
- K. He et al., “Deep Residual Learning for Image Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
- Z. Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10 012–10 022.
- Y. Feng et al., “SNE-RoadSegV2: Advancing Heterogeneous Feature Fusion and Fallibility Awareness for Freespace Detection,” arXiv preprint arXiv:2402.18918, 2024.
- H. Bao et al., “BEiT: BERT Pre-Training of Image Transformers,” arXiv preprint arXiv:2106.08254, 2021.
- Z. Peng et al., “BEiT v2: Masked Image Modeling With Vector-Quantized Visual Tokenizers,” arXiv preprint arXiv:2208.06366, 2022.
- M. Oquab et al., “DINOv2: Learning Robust Visual Features Without Supervision,” Transactions on Machine Learning Research, 2023.
- A. Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks,” Advances in Neural Information Processing Systems (NeurIPS), vol. 25, pp. 1097–1105, 2012.
- S. Zhao et al., “Mitigating Modality Discrepancies for RGB-T Semantic Segmentation,” IEEE Transactions on Neural Networks and Learning Systems, 2023, DOI:10.1109/TNNLS.2022.3233089.
- F. Deng et al., “FEANet: Feature-Enhanced Attention Network for RGB-Thermal Real-time Semantic Segmentation,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 4467–4473.
- J. Liu et al., “Revisiting Modality-Specific Feature Compensation for Visible-Infrared Person Re-Identification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 10, pp. 7226–7240, 2022.
- D. Seichter et al., “Efficient RGB-D Semantic Segmentation for Indoor Scene Analysis,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 13 525–13 531.
- W. Zhou et al., “MTANet: Multitask-Aware Network With Hierarchical Multimodal Fusion for RGB-T Urban Scene Understanding,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 48–58, 2022.
- Y. Sun et al., “FuseSeg: Semantic Segmentation of Urban Scenes Based on RGB and Thermal Data Fusion,” IEEE Transactions on Automation Science and Engineering, vol. 18, no. 3, pp. 1000–1011, 2020.
- J. Long et al., “Fully Convolutional Networks for Semantic Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3431–3440.
- L.-C. Chen et al., “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2017.
- T.-Y. Lin et al., “Feature Pyramid Networks for Object Detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2117–2125.
- A. Vaswani et al., “Attention is All you Need,” Advances in Neural Information Processing Systems (NeurIPS), vol. 30, pp. 5998–6008, 2017.
- S. Zheng et al., “Rethinking Semantic Segmentation From a Sequence-to-Sequence Perspective With Transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6881–6890.
- E. Xie et al., “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers,” Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 12 077–12 090, 2021.
- J. Dai et al., “Convolutional Feature Masking for Joint Object and Stuff Segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3992–4000.
- B. Hariharan et al., “Simultaneous Detection and Segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2014, pp. 297–312.
- B. Cheng et al., “Per-Pixel Classification is Not All You Need for Semantic Segmentation,” Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 17 864–17 875, 2021.
- G. Huang et al., “Densely Connected Convolutional Networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4700–4708.
- K. He et al., “Masked Autoencoders Are Scalable Vision Learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16 000–16 009.
- T. Xiao et al., “Unified Perceptual Parsing for Scene Understanding,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 418–434.
- B. Yin et al., “DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation,” International Conference on Learning Representations (ICLR), 2024.
- K. Li et al., “UniFormer: Unifying Convolution and Self-Attention for Visual Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12 581–12 600, 2023.
- S. Xie et al., “Aggregated Residual Transformations for Deep Neural Networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1492–1500.
- J. Li et al., “RoadFormer: Duplex Transformer for RGB-Normal Semantic Road Scene Parsing,” arXiv preprint arXiv:2309.10356, 2024, to be published by IEEE Transactions on Intelligent Vehicles.
- R. Bachmann et al., “MultiMAE: Multi-modal Multi-task Masked Autoencoders,” in European Conference on Computer Vision (ECCV). Springer, 2022, pp. 348–367.
- S. Huang et al., “FaPN: Feature-Aligned Pyramid Network for Dense Image Prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 864–873.
- X. Yang et al., “PolyMaX: General Dense Prediction with Mask Transformer,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 1050–1061.
- N. Carion et al., “End-to-End Object Detection with Transformers,” in European Conference on Computer Vision (ECCV). Springer, 2020, pp. 213–229.
- W. Zhou et al., “Edge-Aware Guidance Fusion Network for RGB–Thermal Scene Parsing,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 3571–3579.
- S. Srivastava and G. Sharma, “OmniVec: Learning Robust Representations With Cross Modal Sharing,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 1236–1248.
- R. Girdhar et al., “OMNIVORE: A Single Model for Many Visual Modalities,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16 102–16 112.
- Y. Wang et al., “Multimodal Token Fusion for Vision Transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 12 186–12 195.
- S. Du et al., “AsymFormer: Asymmetrical Cross-Modal Representation Learning for Mobile Platform Real-Time RGB-D Semantic Segmentation,” arXiv preprint arXiv:2309.14065, 2023.
- X. Chen et al., “Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation,” in European Conference on Computer Vision (ECCV). Springer, 2020, pp. 561–577.
- H. Touvron et al., “Training data-efficient image transformers & distillation through attention,” in International Conference on Machine Learning (ICML). PMLR, 2021, pp. 10 347–10 357.
- A. P. Steiner et al., “How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers,” Transactions on Machine Learning Research, 2022.
- Y.-H. Kim et al., “MS-UDA: Multi-Spectral Unsupervised Domain Adaptation for Thermal Image Semantic Segmentation,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 6497–6504, 2021.
- M. Cordts et al., “The Cityscapes Dataset for Semantic Urban Scene Understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3213–3223.
- N. Silberman et al., “Indoor Segmentation and Support Inference from RGBD Images,” in European Conference on Computer Vision (ECCV). Springer, 2012, pp. 746–760.
- I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” in International Conference on Learning Representations (ICLR), 2018.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.