PointCNN++: Efficient 3D Convolution

Updated 5 December 2025

The paper introduces a point-centric convolution that avoids coarse grid snapping, achieving sub-voxel precision in 3D point cloud processing.
It leverages an efficient MVMR GPU kernel that minimizes memory overhead and latency by grouping computations and reusing weights.
Experimental benchmarks show that PointCNN++ outperforms traditional voxel and point-based methods in memory usage, speed, and registration accuracy.

PointCNN++ is a generalized 3D convolutional architecture for point cloud data that directly operates on native, high-precision point coordinates. It eliminates the traditional precision-performance trade-off by unifying the flexibility of point-based approaches with the high efficiency of voxel-based convolutions. Central to the design is a point-centric formulation, and a Matrix-Vector Multiplication and Reduction (MVMR) GPU kernel that brings together accuracy, memory efficiency, and computational performance across a wide range of point cloud learning tasks (Li et al., 28 Nov 2025).

1. Mathematical Formulation of Point-Centric Convolution

PointCNN++ defines convolution on native points with the following components:

$P^{in} = \{p_j \in \mathbb{R}^3\}_{j=1}^{N_{in}}$ : input point coordinates,
$F^{in} = \{f_j \in \mathbb{R}^{C_{in}}\}_{j=1}^{N_{in}}$ : input features,
$P^{out} = \{q_i \in \mathbb{R}^3\}_{i=1}^{N_{out}}$ : output (convolution center) points,
$F^{out} = \{g_i \in \mathbb{R}^{C_{out}}\}_{i=1}^{N_{out}}$ : output features,
$W = \{W_k \in \mathbb{R}^{C_{out} \times C_{in}}\}_{k=1}^K$ : $K$ learnable kernel matrices ( $K = t^3$ for $t \times t \times t$ local kernels).

For each output point $q_i$ , a local adaptive voxelization of size $t$ and resolution $F^{in} = \{f_j \in \mathbb{R}^{C_{in}}\}_{j=1}^{N_{in}}$ 0 is centered at $F^{in} = \{f_j \in \mathbb{R}^{C_{in}}\}_{j=1}^{N_{in}}$ 1. If input point $F^{in} = \{f_j \in \mathbb{R}^{C_{in}}\}_{j=1}^{N_{in}}$ 2 falls into the $F^{in} = \{f_j \in \mathbb{R}^{C_{in}}\}_{j=1}^{N_{in}}$ 3-th local cell, the operator sets up a triplet $F^{in} = \{f_j \in \mathbb{R}^{C_{in}}\}_{j=1}^{N_{in}}$ 4. Formally, if $F^{in} = \{f_j \in \mathbb{R}^{C_{in}}\}_{j=1}^{N_{in}}$ 5 is the triplet set, the convolution is:

$F^{in} = \{f_j \in \mathbb{R}^{C_{in}}\}_{j=1}^{N_{in}}$ 6

This mechanism avoids global quantization entirely, and the neighborhood inclusion for each $F^{in} = \{f_j \in \mathbb{R}^{C_{in}}\}_{j=1}^{N_{in}}$ 7 is determined either by KNN or radius search on true input coordinates.

Main advantages over voxel-based convolutions:

No snapping of centers to coarse grids
Accurate (non-proxy) neighborhood selection
Kernel resolution is local and tunable, not globally fixed

2. MVMR—Matrix-Vector Multiplication and Reduction Formalism

PointCNN++ reformulates the convolution as an unstructured sum of small MVMs followed by reduction over triplet indices. For a global triplet list $F^{in} = \{f_j \in \mathbb{R}^{C_{in}}\}_{j=1}^{N_{in}}$ 8:

$F^{in} = \{f_j \in \mathbb{R}^{C_{in}}\}_{j=1}^{N_{in}}$ 9

Each element of this sum is a small MVM, $P^{out} = \{q_i \in \mathbb{R}^3\}_{i=1}^{N_{out}}$ 0, aggregated into $P^{out} = \{q_i \in \mathbb{R}^3\}_{i=1}^{N_{out}}$ 1 by atomic addition. The computational complexity is $P^{out} = \{q_i \in \mathbb{R}^3\}_{i=1}^{N_{out}}$ 2, and the main memory cost is streaming reads of kernel weights and input features, with writes into the final output tensor.

3. Dedicated Native-Point GPU Kernel Design

To achieve high throughput and minimal memory usage, PointCNN++ introduces a GPU kernel tailored to the MVMR pattern:

Sorts triplet list $P^{out} = \{q_i \in \mathbb{R}^3\}_{i=1}^{N_{out}}$ 3 by kernel index $P^{out} = \{q_i \in \mathbb{R}^3\}_{i=1}^{N_{out}}$ 4 to enable weight reuse in on-chip cache.
Groups $P^{out} = \{q_i \in \mathbb{R}^3\}_{i=1}^{N_{out}}$ 5 consecutive triplets for warp-level execution.
Tiles kernel matrices into subblocks of $P^{out} = \{q_i \in \mathbb{R}^3\}_{i=1}^{N_{out}}$ 6 for efficient on-chip MVM.
Uses atomic adds only for the output reduction.
Requires zero intermediate buffers beyond input, output, and kernel tensors.

A high-level algorithmic flow:

For each block of triplets: load kernel weight $P^{out} = \{q_i \in \mathbb{R}^3\}_{i=1}^{N_{out}}$ 7 and feature $P^{out} = \{q_i \in \mathbb{R}^3\}_{i=1}^{N_{out}}$ 8, perform MVMs, and accumulate in a register.
Output is written with a single atomic add per output index.
Hyperparameters $P^{out} = \{q_i \in \mathbb{R}^3\}_{i=1}^{N_{out}}$ 9 yield robust performance across architectures and tasks.

4. Paradigm Comparison with Prior 3D Convolutional Operators

The following table summarizes qualitative distinctions among Voxel-based, Point + Transform (PointCNN/KPConv), and PointCNN++ convolutions:

Method	Output Centers $F^{out} = \{g_i \in \mathbb{R}^{C_{out}}\}_{i=1}^{N_{out}}$ 0	Precision	Memory Overhead
Voxel-based	quantized voxel grid	sub-voxel lost	$F^{out} = \{g_i \in \mathbb{R}^{C_{out}}\}_{i=1}^{N_{out}}$ 1 + hashmaps
Point+Transform	points $F^{out} = \{g_i \in \mathbb{R}^{C_{out}}\}_{i=1}^{N_{out}}$ 2 dense tensor	high, slow	$F^{out} = \{g_i \in \mathbb{R}^{C_{out}}\}_{i=1}^{N_{out}}$ 3 (padded)
PointCNN++	original input points	full fidelity	$F^{out} = \{g_i \in \mathbb{R}^{C_{out}}\}_{i=1}^{N_{out}}$ 4 extra (beyond tensors)

Key differentiation:

PointCNN++ achieves the same theoretical compute complexity as sparse voxel and past point-based convolutions but requires no intermediate dense materialization or memory allocation, retaining true geometric detail at native resolution (Li et al., 28 Nov 2025).

5. Experimental Benchmarking

A. Micro-benchmarks (ResNet-18, $F^{out} = \{g_i \in \mathbb{R}^{C_{out}}\}_{i=1}^{N_{out}}$ 5 kernel, $F^{out} = \{g_i \in \mathbb{R}^{C_{out}}\}_{i=1}^{N_{out}}$ 6, RTX4090, 1M points)

GPU memory consumption (forward/backward):
- MinkowskiEngine: ~1.2 GB / 2.3 GB
- TorchSparse++: ~1.0 GB / 2.0 GB
- KPConv: ~5.5 GB / 8.0 GB
- PointCNN++: 0.37 GB / 0.59 GB
Latency per iteration (forward/backward):
- $F^{out} = \{g_i \in \mathbb{R}^{C_{out}}\}_{i=1}^{N_{out}}$ 7VDB: 49.6 ms / 105.5 ms
- TorchSparse++: 55 ms / 120 ms
- KPConv: 120 ms / 240 ms
- PointCNN++: 60.4 ms / 75.5 ms

B. Point Cloud Registration

KITTI Odometry (FCGF backbone replacement):
- Relative Translation Error: $F^{out} = \{g_i \in \mathbb{R}^{C_{out}}\}_{i=1}^{N_{out}}$ 8 m $F^{out} = \{g_i \in \mathbb{R}^{C_{out}}\}_{i=1}^{N_{out}}$ 9 $W = \{W_k \in \mathbb{R}^{C_{out} \times C_{in}}\}_{k=1}^K$ 0 (best)
- Relative Rotation Error: $W = \{W_k \in \mathbb{R}^{C_{out} \times C_{in}}\}_{k=1}^K$ 1 (2nd best)
- Recall @ $W = \{W_k \in \mathbb{R}^{C_{out} \times C_{in}}\}_{k=1}^K$ 2: $W = \{W_k \in \mathbb{R}^{C_{out} \times C_{in}}\}_{k=1}^K$ 3 (best)
- Parameters: $W = \{W_k \in \mathbb{R}^{C_{out} \times C_{in}}\}_{k=1}^K$ 4M
3DMatch (varied samples):
- Registration Recall @5000 pts: 90.3%
- Feature Matching Recall: 98.9% (best)
- Inlier Ratio: 58.2%

These results confirm PointCNN++ as a plug-and-play backbone delivers state-of-the-art registration and matching performance, while reducing both memory usage and latency by an order of magnitude over other point-based operators (Li et al., 28 Nov 2025).

6. Guidelines for Adoption and Recognized Limitations

Integration: PointCNN++ is directly compatible as a drop-in replacement for sparse-convolution backbones (e.g., MinkowskiEngine). The remainder of existing network architectures require no changes. Neighborhood selection should be by fixed-radius or KNN search on true coordinates, with $W = \{W_k \in \mathbb{R}^{C_{out} \times C_{in}}\}_{k=1}^K$ 5 (kernel resolution) chosen to suit local complexity.

Hyperparameters: The default MVMR kernel settings—group size $W = \{W_k \in \mathbb{R}^{C_{out} \times C_{in}}\}_{k=1}^K$ 6, tile sizes $W = \{W_k \in \mathbb{R}^{C_{out} \times C_{in}}\}_{k=1}^K$ 7—perform robustly across scenarios and modern hardware.

Advantages:

Maintains sub-voxel geometric fidelity, crucial for registration, surface normal estimation, and fine segmentation.
Enables training with larger batch sizes and/or higher point counts due to minimal memory overhead.
Achieves balanced and competitive latency for both inference and training.

Limitations and Future Directions:

The sort overhead associated with highly dynamic neighborhoods can become significant; an automated tuner for sort axis may ameliorate this in specialized contexts.
For extremely large-scale point clouds (e.g., $W = \{W_k \in \mathbb{R}^{C_{out} \times C_{in}}\}_{k=1}^K$ 8M points), external-memory or hierarchical sampling frameworks must be adopted.
Extension to anisotropic or deformable kernels (as in KPConv) would require additional complexity in triplet set construction.
Hybrid models incorporating attention or transformer modules atop the PointCNN++ backbone are a prospective avenue for broadening receptive field and context.

7. Context and Impact

PointCNN++ fundamentally generalizes sparse convolution from voxels to points, establishing voxel-based methods as a special, quantized case of point-centric convolution. Its architectural and kernel-level innovations address the geometric fidelity limitations that constrained prior efficient 3D learning paradigms, while establishing new memory and speed baselines. The formulation demonstrates that fine-grained geometric detail and high performance in point cloud deep learning are compatible and sets a path for high-fidelity, efficient 3D representation learning across segmentation, detection, and registration pipelines (Li et al., 28 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

PointCNN++: Performant Convolution on Native Points (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PointCNN++.