RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision

Published 18 Sep 2023 in cs.CV | (2309.09502v2)

Abstract: 3D occupancy prediction holds significant promise in the fields of robot perception and autonomous driving, which quantifies 3D scenes into grid cells with semantic labels. Recent works mainly utilize complete occupancy labels in 3D voxel space for supervision. However, the expensive annotation process and sometimes ambiguous labels have severely constrained the usability and scalability of 3D occupancy models. To address this, we present RenderOcc, a novel paradigm for training 3D occupancy models only using 2D labels. Specifically, we extract a NeRF-style 3D volume representation from multi-view images, and employ volume rendering techniques to establish 2D renderings, thus enabling direct 3D supervision from 2D semantics and depth labels. Additionally, we introduce an Auxiliary Ray method to tackle the issue of sparse viewpoints in autonomous driving scenarios, which leverages sequential frames to construct comprehensive 2D rendering for each object. To our best knowledge, RenderOcc is the first attempt to train multi-view 3D occupancy models only using 2D labels, reducing the dependence on costly 3D occupancy annotations. Extensive experiments demonstrate that RenderOcc achieves comparable performance to models fully supervised with 3D labels, underscoring the significance of this approach in real-world applications.

Abstract PDF Upgrade to Chat

Citations (49)

View on Semantic Scholar

Summary

The paper introduces RenderOcc’s novel approach using 2D rendering supervision to predict 3D occupancy efficiently.
It employs a NeRF-style 3D volumetric representation with semantic density fields and auxiliary rays to enhance multi-view consistency.
Evaluation on NuScenes and SemanticKiTTI shows competitive mIoU performance compared to models relying on full 3D labels.

An In-Depth Analysis of "RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision"

Introduction

The paper "RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision" (2309.09502) introduces an innovative approach to enhancing 3D occupancy prediction. This task is crucial in applications such as robotic perception and autonomous driving, where it involves assigning semantic labels to individual cells within a 3D voxel grid. Traditionally reliant on costly and ambiguous 3D labels for training, the existing methodologies have presented challenges in terms of scalability due to the high expense involved in creating detailed 3D annotations. RenderOcc addresses these limitations by pioneering a novel training paradigm that leverages less expensive 2D image labels for effective supervision.

Methodology Overview

RenderOcc's primary advancement is its novel training approach that employs 2D labels, circumventing the need for direct 3D annotations. It constructs a NeRF-style 3D volume representation from multi-view images, using volume rendering techniques to produce 2D renderings that support 3D supervision. This allows the model to learn from pixel-level semantic and depth labels, enhancing the fine-grained geometric understanding traditionally constrained by the quality and cost of 3D data.

RenderOcc Framework

The framework (Figure 1) integrates volume features from multi-view images using a network structure that is flexible to various encoder implementations. RenderOcc introduces two key components for generating semantic and density outputs, collectively forming the Semantic Density Field (SDF). This representation not only supports the generation of renderings for supervision but also facilitates the conversion of these renderings back into 3D occupancy results through optimized prediction mechanisms.

Figure 1: Overall framework of RenderOcc. We extract volume features $V$ and predict density $\sigma$ and semantic $S$ for each voxel through a 2D-to-3D network.

Auxiliary Ray Strategy

A critical challenge in applying 2D rendering supervision is the sparse viewpoint coverage in real-world autonomous driving scenarios. RenderOcc introduces Auxiliary Rays extracted from sequential frames to address this limitation, thereby enhancing the multi-view consistency of constraints. This component, coupled with a dynamic sampling strategy termed Weighted Ray Sampling (WRS), effectively balances the training efficiency and performance by selecting informative and temporally aligned rays from adjacent frames.

Experimental Analysis

The experimental setup evaluates RenderOcc on two prominent datasets, NuScenes and SemanticKiTTI, common in benchmarking 3D perception systems. RenderOcc's performance parallels traditional 3D label-supervised models, a significant achievement given its reliance on 2D supervision alone.

Performance on NuScenes and SemanticKiTTI

Notably, on the large-scale NuScenes dataset, RenderOcc achieves an average mIoU that is comparable to state-of-the-art models supervised with full 3D labels. With the aid of Auxiliary Rays, RenderOcc provides improved interpretations of small and distant objects, benefiting particularly from the rich semantic detail available in 2D labels.

Conclusion

RenderOcc stands as a significant step forward in the development of more cost-effective and scalable systems for 3D occupancy prediction. By optimizing the use of 2D labels through advanced rendering techniques, RenderOcc effectively challenges the traditional reliance on 3D annotations. The results testify to the method's robustness and suggest promising future pathways for further reducing dependencies on expensive data, potentially broadening the accessibility and applicability of 3D perception technologies in various domains.