PV-RCNN++: LiDAR 3D Object Detection

Updated 20 January 2026

The paper presents PV-RCNN++ with novel point-voxel fusion strategies and efficient sampling techniques, significantly improving detection speed and precision.
It employs a two-stage architecture that leverages both global voxel features and fine-grained point details through methods like VectorPool and attention-based aggregation.
Empirical results on Waymo and KITTI datasets demonstrate enhanced mAP and FPS while effectively addressing spatial sparsity and foreground-background imbalances.

PV-RCNN++ refers to a class of LiDAR-based 3D object detectors that advance the PV-RCNN paradigm via improved point-voxel feature integration, efficient sampling strategies, and enhanced local and non-local feature aggregation. Across prominent lineages, PV-RCNN++ methods achieve state-of-the-art accuracy and efficiency by addressing the challenges of spatial sparsity, foreground-background imbalance, and real-time requirements on large-scale autonomous driving datasets. The two major published variants are: (1) PV-RCNN++ with Sectorized Proposal-Centric Sampling and VectorPool (Shi et al., 2021), and (2) PV-RCNN++ with Semantical Point-Voxel Feature Interaction (Wu et al., 2022). Both integrate a two-stage architecture, wherein proposal generation is separated from fine-grained box refinement, utilizing both sparse-voxel CNN backbones and pointwise set abstraction modules for multi-scale feature fusion.

1. Two-Stage Network Architecture and Feature Intertwining

PV-RCNN++ operates in a two-stage regime. The raw point cloud $P = \{p_i \in \mathbb{R}^3\}_{i=1}^M$ is voxelized and processed with a 3D sparse convolutional backbone to generate multi-scale voxel features at four spatial resolutions (strides 1, 2, 4, 8). A region proposal network (RPN), typically anchor-based, produces candidate bounding boxes based on these bird’s-eye-view (BEV) representations.

Critical to the PV-RCNN family is the “deep feature intertwining” between point-based and voxel-based abstraction:

Voxel-to-Keypoint Scene Encoding (VSA): For each sampled keypoint $q_i$ , voxel features in a local radius are aggregated with positional offsets, passed through PointNet-style shared MLPs and max-pooling at multiple scales.
Keypoint-to-Grid RoI Feature Abstraction: For each region proposal, a fixed 3D RoI grid is built, and keypoints within a given radius are grouped for feature aggregation at each grid cell. Resulting features are pooled and refined via MLPs before final bounding box regression and classification.

This structure ensures that the voxel grid provides global context and spatial organization, while pointwise abstraction preserves fine-grained geometric detail.

2. Efficient Keypoint Sampling: Sectorized and Semantic-Guided Approaches

Original PV-RCNN employs vanilla farthest point sampling (FPS) for keypoint selection, which is computationally expensive and may not preferentially focus on foreground regions.

Sectorized Proposal-Centric (SPC) Sampling (Shi et al., 2021): Instead of sampling among all points ( $O(M \cdot n)$ ), SPC first filters to points near proposal boxes, then partitions the remaining points into $s$ angular sectors about the LiDAR origin. FPS is performed in parallel within each sector, then concatenated. This reduces keypoint sampling time from $\sim$ 133 ms to $\sim$ 9 ms on Waymo scenes, with constant spatial coverage (∼84.8%) and higher recall near objects.
Semantic-Guided Farthest Point Sampling (S-FPS) (Wu et al., 2022): Each point is assigned a semantic foreground score from a lightweight segmentation branch. During sampling, inter-point distances are “boosted” by the exponential of their semantic score, so that likely-object points are preferentially selected. This results in improved recall for small or sparse classes since sampling is driven by semantic discriminativeness.

3. Local Feature Aggregation: From Set Abstraction to VectorPool and Voxel Attention

Traditional set abstraction modules (as in PointNet++) apply shared MLPs to each neighbor in a group and aggregate via max-pooling, leading to heavy memory and FLOP overhead.

VectorPool Aggregation (Shi et al., 2021): The local neighborhood of each keypoint is discretized into a grid of sub-voxels. For each sub-voxel, features are interpolated from the nearest points/voxels, assembled into a fixed-length vector, and processed jointly by a “single” MLP, eliminating the per-neighbor MLPs and max-pooling. This preserves local geometric structure, is position-sensitive, and reduces compute load by $\approx$ 4.7 GFLOPs per frame and memory by 3 GB on Waymo scenes.
Attention-based Residual PointNet (Wu et al., 2022): Instead of only local aggregation, grouped voxel features are fused via a multi-head self-attention mechanism, yielding non-local receptive fields. After attention, features are concatenated with the original group and passed through a residual 1D-conv/BN/ReLU block followed by max-pooling. This design expands feature context adaptively and empirically improves detection accuracy, especially on small or complex objects.

4. Fast Voxel Query for Multi-Scale Feature Grouping

Efficiently associating point-based and voxel-based features is critical for runtime.

Manhattan-Distance Voxel Query (Wu et al., 2022): For each keypoint, instead of an $O(N)$ scan over all voxels (as in ball query), neighborhoods are enumerated by integer lattice offsets within an $L_1$ norm threshold $I$ around the center voxel. This approach samples a fixed, bounded number of nearby voxels ( $M \ll N$ ) in $O(M)$ time. Empirically, setting $I=4$ leads to up to $\sim$ 49 offsets per keypoint with no drop in accuracy and a grouping speedup from 12.5 Hz to 15.15 Hz.

5. Training, Inference, and Detection Head

PV-RCNN++ employs standard and custom losses to optimize detection and semantic keypoint selection:

Losses: The total loss comprises semantic segmentation ( $L_\text{seg}$ ), RPN classification/regression/direction losses ( $L_\text{rpn}$ ), RCNN box refinement losses ( $L_\text{rcnn}$ ), and a keypoint weighting loss ( $L_\text{key}$ ). Weighted BCE, focal loss, and smooth- $L_1$ are used as appropriate.
RoI-grid Pooling: Each proposal is discretized into a $6 \times 6 \times 6$ grid. For every cell, nearby keypoints are grouped, aggregated, and passed through small PointNet/MLP heads. The outputs serve to regress refined boxes and assign classification scores.
Inference: Non-maximum suppression (NMS) is applied pre- and post-refinement. The average FPS on 150 $\times$ 150 m $^{2}$ Waymo scenes is 10 FPS for PV-RCNN++, compared to 3.3 FPS in the PV-RCNN baseline.

6. Empirical Evaluation and Ablation Results

On the Waymo Open Dataset, PV-RCNN++ (Shi et al., 2021) outperforms both PV-RCNN and CenterPoint in both accuracy and speed. On the KITTI dataset, the semantic-interaction PV-RCNN++ variant (Wu et al., 2022) achieves state-of-the-art 3D mAP, especially notable on Cyclist and Pedestrian classes.

Method	Car (Mod)	Pedestrian (Mod)	Cyclist (Mod)	FPS	FLOPs per Frame	Keypoint Sampling (ms)
PV-RCNN	81.40	39.04	63.12	3.3	+4.68 GF	133
PV-RCNN++ [2102]	81.60	40.18	68.21	10	baseline	9
PV-RCNN++ [2208]	81.60	40.18	68.21	15.15	baseline	-

Ablation studies confirm that:

Sectorized/proposal-centric sampling reduces sampling latency by a factor of $\sim$ 15, with no loss in spatial coverage or downstream accuracy.
VectorPool delivers –4.68 GFLOPs, –3 GB RAM, and $+1.71\%$ average mAPH versus set abstraction, with heightened improvements for fine classes (Pedestrian/Cyclist).
Semantic-guided FPS (S-FPS, $\gamma=1$ ) further improves moderate Car AP by $+0.75$ over FPS. The Manhattan voxel query maintains AP while increasing grouping throughput. Attention-based grouping yields an additional $+0.19$ Car AP.

7. Innovations, Impact, and Significance

PV-RCNN++ demonstrates that computational bottlenecks in point-voxel abstraction can be addressed through intelligent sampling (sectorized or semantic-guided), compact and structure-aware aggregation (VectorPool or attention-based PointNet), and fast $L_1$ -based voxel queries. These innovations contribute to increased throughput (up to 3×), reduced compute/memory costs, and up to a 2–5% absolute gain in average precision over prior PV-RCNN, across both large-scale (Waymo) and moderate (KITTI) benchmarks. The modular architecture and explicit fusion of semantic segmentation for foreground focus further improve recall for small and challenging object categories. These advances solidify PV-RCNN++ as a reference framework in real-time, high-precision 3D LiDAR object detection (Shi et al., 2021, Wu et al., 2022).

Markdown Report Issue Upgrade to Chat

References (2)

PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection (2021)

PV-RCNN++: Semantical Point-Voxel Feature Interaction for 3D Object Detection (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PVRCNN++.