Rotationally Invariant Features (RIF)

Updated 24 November 2025

Rotationally Invariant Features (RIF) are characteristics that remain consistent under rotations, ensuring reliable 3D point cloud analysis.
They combine convolution-based local extraction and transformer-based global context to capture fine-grained geometry and long-range dependencies.
CTF-Net’s dual-direction feature transmission demonstrates how RIF can be integrated to achieve state-of-the-art accuracy in classification and segmentation tasks.

The Convolutional Transform Feature Network (CTF-Net) is a high-performance deep learning architecture for point cloud analysis that achieves simultaneous extraction and effective fusion of local and global features. Its design centers on the CT-block, a dual-branch module combining convolution-based local processing and transformer-based global relations, unified through learnable feature transmission bridges. CTF-Net is suited for tasks such as 3D shape classification and part segmentation, demonstrating state-of-the-art accuracy and efficiency by leveraging joint feature learning and judicious architectural modularity (Guo et al., 2021).

1. Motivation and Context

Point cloud data, defined as unordered sets of 3D points, encode both fine-grained local geometry and broader structural context. Prevailing local feature methods—including PointNet++ and graph convolutional networks—emphasize neighborhood aggregation but struggle to model long-range dependencies. Conversely, transformer-based attention models can encode global relationships, yet lack effective priors for local geometric detail. Empirical deficiencies in both regimes motivate networks that fuse these paradigms. The CT-block addresses this by coupling a convolutional branch for locality and a transformer branch for globality, with dual-direction feature transmission for mutual guidance and semantic bridging (Guo et al., 2021).

2. Architecture of the CT-block

The CT-block forms the atomic unit of CTF-Net, operating on two parallel feature streams at each stage: the local feature $F_\ell^{(i-1)} \in \mathbb{R}^{N_\ell \times C_\ell}$ and the global feature $F_g^{(i-1)} \in \mathbb{R}^{N_g \times C_g}$ . It yields updated and fused representations $F_\ell^{(i)}$ , $F_g^{(i)}$ through:

2.1 Convolution-branch (Local Feature Extraction)

Sampling and Grouping (SG): Farthest-point sampling (FPS) selects $N_\ell^\text{out}$ points; $S$ nearest neighbors are grouped for each, producing $F_1 \in \mathbb{R}^{N_\ell^\text{out} \times S \times C_\ell}$ .
First MLP block ("conv₁"): Point-wise Linear $\circ$ BN $\circ$ ReLU transforms to $F_2 = \mathrm{LBR}_1(F_1) \in \mathbb{R}^{N_\ell^\text{out} \times S \times C_2}$ .
Feature Transmission (global→local): Fused with down-sampled global feature $F_2' = F_2 + \mathrm{ft}_2(F_g^{(i)})$ .
Second MLP block ("conv₂"): $F_3 = \mathrm{LBR}_2(F_2') \in \mathbb{R}^{N_\ell^\text{out} \times S \times C_\text{out}}$ .
Max-Pooling: Aggregates neighbor dimension, $F_\ell^{(i)} = \max_{j=1..S} F_3[*, j, *] \in \mathbb{R}^{N_\ell^\text{out} \times C_\text{out}}$ .

2.2 Transformer-branch (Global Feature Extraction via Offset-Attention)

Projection: Queries, keys, and values are computed: $[Q,K,V] = F_g^\text{in}[W_q, W_k, W_v]$ with $Q,K \in \mathbb{R}^{N_g \times d_a}$ , $V \in \mathbb{R}^{N_g \times d_e}$ .
Attention Matrix: $\bar{A} = QK^\top$ .
Offset-Attention Normalization: Double softmax normalization:

$\hat{\alpha}_{i,j} = \frac{\exp(\bar{\alpha}_{i,j})}{\sum_k \exp(\bar{\alpha}_{k,j})},\quad \alpha_{i,j} = \frac{\hat{\alpha}_{i,j}}{\sum_k \hat{\alpha}_{i,k}}$

Context Aggregation: $F_a = A V$ ; residual follows:

$F_g^{\text{out}} = \mathrm{LBR}(F_a - F_g^{\text{in}}) + F_g^{\text{in}}$

Feature Transmission (local→global): Local features are upsampled, projected, and incorporated before QKV calculation.

2.3 Feature Transmission Elements

Two one-way mappings bridge local/global streams:

$\mathrm{ft}_1$ (local→global): Upsample via distance-weighted interpolation, linear + BN to match $C_g$ .
$\mathrm{ft}_2$ (global→local): Downsample global features to matched subset, linear + BN to match $C_\text{out}$ . Each ensures both alignment in point count and feature dimensionality, facilitating bi-directional information flow and semantic fusion.

3. CTF-Net Backbone Construction

CTF-Net is realized by stacking $L=3$ CT-blocks.

3.1 Classification Pipeline

Input: $F_\text{in} \in \mathbb{R}^{N \times C}$ , typically $C=3$ (coordinate).
Initial Embeddings: Parallel branches: FPS $\,\downarrow \!\times$ MLP for local ( $F_\ell^0$ ), MLP for global ( $F_g^0$ ).
Stagewise CT-blocks: Each applies convolution-branch (downsample by $2$, channel-doubling), transformer-branch (dimension preserved).
Heads: Local—max-pool $F_\ell^L$ then fully-connected; global—concatenate $F_g^i$ , max-pool, and fully-connected.
Loss: Dual cross-entropy, outputs averaged in inference.

3.2 Segmentation Pipeline

Encoder: Identical structure, up to $L$ CT-blocks.
Decoder: Upsample local features at each level, optional skip connections, final per-point heads; global features concatenated and passed through analogous global head.
Loss: Dual per-point cross-entropy; summed prediction at inference.

4. Forward Pass Pseudocode

The core logic of a CT-block forward pass is:

def CT_block_forward(F_l_in, F_g_in):
    # Convolution-branch (local)
    sample_idx = FPS(F_l_in, N_l_out)
    F_grp = group_points(F_l_in, sample_idx, S)    # (N_l_out, S, C_l_in)
    F2 = LBR1(F_grp)                               # (N_l_out, S, C2)
    F_g_proj = ft2(F_g_in)                         # (N_l_out, C2)
    F2p = F2 + broadcast(F_g_proj, dim=S)
    F3 = LBR2(F2p)                                 # (N_l_out, S, C_out)
    F_l_out = max_pool(F3, dim="neighbors")        # (N_l_out, C_out)
    # Transformer-branch (global)
    F_loc_proj = ft1(F2)                           # (N_g, C_g_in)
    Q, K, V = linear_qkv(F_g_in + F_loc_proj)
    A = normalize_offset_attention(Q, K)
    F_a = A @ V
    F_g_mid = LBR(F_a - F_g_in) + F_g_in
    F_g_out = F_g_mid
    return F_l_out, F_g_out

(Guo et al., 2021)

5. Hyper-parameters and Training Protocols

CT-blocks: $L=3$ stacked stages.
Neighbors per group: $S=32$ .
Transformer embedding/attention: $d_e=256$ , $d_a=64$ (single-head).
Channels: $C_\ell$ doubles each stage, e.g., $32 \rightarrow 64 \rightarrow 128$ .
Optimization: SGD, $0.9$ momentum, initial LR $10^{-3}$ with cosine annealing.
Loss: Dual cross-entropy per head (global/local).
Augmentation: For classification, random $z$ -rotation, jitter $\sigma=0.02$ ; for segmentation, anisotropic scaling in $[0.8,1.25]$ .
Segmentation inference: Single- and multi-scale (scales $[0.8:0.1:1.25]$ ).

6. Empirical Results and Ablation

6.1 Classification (ModelNet40)

Setup: $1024$ points, $xyz$ coordinates.
Metrics: overall accuracy (OA), mean class accuracy (mAcc).
Performance: OA $93.5\%$ , mAcc $90.8\%$ ; outperforms PointNet++ ( $91.9\%$ OA) and matches PCT ( $93.2\%$ OA).

6.2 Segmentation (ShapeNetPart)

Setup: $2048$ points, per-point 50-part label.
Result: Part-average IoU (pIoU): $86.5\%$ (multi-scale), surpasses PointNet++ ( $85.1\%$ ) and matches PCT ( $86.4\%$ ).

6.3 Ablation Study

Four variants (ModelNet40 OA / ShapeNetPart pIoU):

Variant	ModelNet40 OA	ShapeNetPart pIoU
Conv-only	91.82%	85.23%
Transformer-only	91.75%	85.51%
No feature transmission	92.59%	85.70%
Full CT-block	93.52%	86.29%

The feature transmission bridges confer measurable improvement in both accuracy and IoU.

6.4 Hyper-parameter Trade-offs

Neighbor count $S$ : $32$ is optimal for FLOPs/accuracy balance.
Embedding dimension $d_e$ : $256$ is optimal; higher values risk overfitting and cost.

CTF-Net achieves joint local-global feature coupling in point clouds more effectively than single-paradigm backbones or sequential hybridization (e.g., PCT, 3DCTN (Lu et al., 2022)). The empirical results on ModelNet40 and ShapeNetPart demonstrate the practical advantage of bridging local detail and global context via lightweight, learnable feature transmission at each stage. Notably, competitive approaches such as 3DCTN also emphasize this principle, employing interleaved graph-convolutional and transformer modules, but lack explicit mutual feature guidance via dual-direction bridges.

CTF-Net's modularity, efficient computation, and accuracy suggest its deployment across a range of 3D understanding tasks, with the CT-block architecture providing a blueprint for future research in harmonic local-global feature learning (Guo et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

CT-block: a novel local and global features extractor for point cloud (2021)

3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rotationally Invariant Features (RIF).