LiDAR-Camera Fusion for Video Panoptic Segmentation without Video Training

Published 30 Dec 2024 in cs.CV | (2412.20881v1)

Abstract: Panoptic segmentation, which combines instance and semantic segmentation, has gained a lot of attention in autonomous vehicles, due to its comprehensive representation of the scene. This task can be applied for cameras and LiDAR sensors, but there has been a limited focus on combining both sensors to enhance image panoptic segmentation (PS). Although previous research has acknowledged the benefit of 3D data on camera-based scene perception, no specific study has explored the influence of 3D data on image and video panoptic segmentation (VPS).This work seeks to introduce a feature fusion module that enhances PS and VPS by fusing LiDAR and image data for autonomous vehicles. We also illustrate that, in addition to this fusion, our proposed model, which utilizes two simple modifications, can further deliver even more high-quality VPS without being trained on video data. The results demonstrate a substantial improvement in both the image and video panoptic segmentation evaluation metrics by up to 5 points.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel LiDAR-camera fusion technique enhancing frame-to-frame consistency in video panoptic segmentation.
It employs location-aware and time-aware queries to integrate spatial and temporal features without video-specific training.
Experimental results show up to a five-point improvement in segmentation metrics on standard benchmarks like Cityscapes-vps.

LiDAR-Camera Fusion for Video Panoptic Segmentation without Video Training

Introduction

This paper investigates the integration of LiDAR and camera data for improving video panoptic segmentation (VPS) without relying on video-trained models. Panoptic segmentation, which combines instance and semantic segmentation tasks, is pivotal in the context of autonomous vehicles. While previous works have demonstrated the benefits of incorporating 3D data like LiDAR into image-based models, there is a dearth of research on its application to VPS. The primary contribution of this work is a feature fusion module that enhances segmentation performance through LiDAR-camera data integration. Notably, the model achieves competitive results without training on video datasets, showcasing an improvement in segmentation metrics by up to five points.

Panoptic Segmentation

Recent advances in panoptic segmentation have been driven by transformer-based approaches, such as Mask2Former, which employ object queries that offer strong discriminative properties. The model utilizes masked attention to focus on predicted segmentation regions, significantly enhancing the accuracy and efficiency of segmentation tasks over pixel-wise approaches.

Online Video Panoptic Segmentation

Most methods for VPS revolve around leveraging transformer queries to track objects across frames through bipartite matching, eschewing the need for video-specific training (1234.56789). This paper capitalizes on transformer-based techniques to build a video-free VPS model, differentiating from supervised models which often incorporate contrastive loss or memory banks for enhanced tracking performance.

LiDAR-Image Multimodal Learning

Fusing LiDAR with vision data has a long-standing history in improving object detection tasks. Research highlights the advantages of supplementing camera data with depth information to capture strong edge details, which enhances segmentation performance (2345.6789). However, prior studies primarily concentrated on semantic segmentation, thus underlining the novelty of the current work in its specific application to panoptic segmentation.

Methodology

The proposed methodology enhances both image and video panoptic segmentation by fusing LiDAR and image data. The model employs a transformer decoder to leverage object queries effectively, introducing two network modifications:

Location-Aware Queries (LAQ): These queries employ an MLP network to predict segment positions, furnishing queries with spatial awareness to improve frame-to-frame matching.
Time-Aware Queries (TAQ): Extends the queries from previous frames to the current frame during evaluation, improving continuity and accuracy across frames.

The model architecture and fusion strategy, as shown in Figure 1, combine image and depth features through a novel dynamic weighting mechanism. This improves segmentation by dynamically deciding the contribution of depth data relative to image features.

Figure 1: Overall architecture of the proposed method, which is based on Mask2Former.

Experimentation and Results

Dataset and Training

The Cityscapes dataset is used for training the model for panoptic segmentation tasks, while video segmentation is evaluated on the Cityscapes-vps [3456.7890]. The LiDAR data is simulated due to dataset limitations, and the model trains using modified depth information.

Panoptic Segmentation Performance

The fusion of LiDAR and image data enhances segmentation accuracy, especially in object segmentation tasks. The experiments confirm that dynamic weighting significantly outperforms simple addition of features, yielding a notable improvement in segmentation quality (Table 1).

Video Panoptic Segmentation Results

Table 2 showcases the improvements achieved by the proposed model over the baseline Mask2Former in VPS metrics. Incorporating LAQ and TAQ yields performance gains that narrow the gap between video-free and video-supervised models. Notably, the model excels in "things" segmentation, underlining its utility in applications requiring detailed object tracking.

Figure 2: Panoptic segmentation output for a video sequence. The base model (left) shows significant ID switches compared to our proposed method (right).

Conclusion

The research delivers a robust framework for LiDAR-camera integration in VPS, highlighting the feasibility of achieving high performance without the need for video-based training. Despite currently lagging behind video-supervised counterparts, the substantial improvements in segmentation accuracy underscore the model's potential for autonomous vehicle applications. Future work should explore scaling the approach for larger batches and employing advanced fusion techniques to further improve results.

Markdown Report Issue