Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rotationally Invariant Features (RIF)

Updated 24 November 2025
  • Rotationally Invariant Features (RIF) are characteristics that remain consistent under rotations, ensuring reliable 3D point cloud analysis.
  • They combine convolution-based local extraction and transformer-based global context to capture fine-grained geometry and long-range dependencies.
  • CTF-Net’s dual-direction feature transmission demonstrates how RIF can be integrated to achieve state-of-the-art accuracy in classification and segmentation tasks.

The Convolutional Transform Feature Network (CTF-Net) is a high-performance deep learning architecture for point cloud analysis that achieves simultaneous extraction and effective fusion of local and global features. Its design centers on the CT-block, a dual-branch module combining convolution-based local processing and transformer-based global relations, unified through learnable feature transmission bridges. CTF-Net is suited for tasks such as 3D shape classification and part segmentation, demonstrating state-of-the-art accuracy and efficiency by leveraging joint feature learning and judicious architectural modularity (Guo et al., 2021).

1. Motivation and Context

Point cloud data, defined as unordered sets of 3D points, encode both fine-grained local geometry and broader structural context. Prevailing local feature methods—including PointNet++ and graph convolutional networks—emphasize neighborhood aggregation but struggle to model long-range dependencies. Conversely, transformer-based attention models can encode global relationships, yet lack effective priors for local geometric detail. Empirical deficiencies in both regimes motivate networks that fuse these paradigms. The CT-block addresses this by coupling a convolutional branch for locality and a transformer branch for globality, with dual-direction feature transmission for mutual guidance and semantic bridging (Guo et al., 2021).

2. Architecture of the CT-block

The CT-block forms the atomic unit of CTF-Net, operating on two parallel feature streams at each stage: the local feature F(i1)RN×CF_\ell^{(i-1)} \in \mathbb{R}^{N_\ell \times C_\ell} and the global feature Fg(i1)RNg×CgF_g^{(i-1)} \in \mathbb{R}^{N_g \times C_g}. It yields updated and fused representations F(i)F_\ell^{(i)}, Fg(i)F_g^{(i)} through:

2.1 Convolution-branch (Local Feature Extraction)

  • Sampling and Grouping (SG): Farthest-point sampling (FPS) selects NoutN_\ell^\text{out} points; SS nearest neighbors are grouped for each, producing F1RNout×S×CF_1 \in \mathbb{R}^{N_\ell^\text{out} \times S \times C_\ell}.
  • First MLP block ("conv₁"): Point-wise Linear \circ BN \circ ReLU transforms to F2=LBR1(F1)RNout×S×C2F_2 = \mathrm{LBR}_1(F_1) \in \mathbb{R}^{N_\ell^\text{out} \times S \times C_2}.
  • Feature Transmission (global→local): Fused with down-sampled global feature F2=F2+ft2(Fg(i))F_2' = F_2 + \mathrm{ft}_2(F_g^{(i)}).
  • Second MLP block ("conv₂"): F3=LBR2(F2)RNout×S×CoutF_3 = \mathrm{LBR}_2(F_2') \in \mathbb{R}^{N_\ell^\text{out} \times S \times C_\text{out}}.
  • Max-Pooling: Aggregates neighbor dimension, F(i)=maxj=1..SF3[,j,]RNout×CoutF_\ell^{(i)} = \max_{j=1..S} F_3[*, j, *] \in \mathbb{R}^{N_\ell^\text{out} \times C_\text{out}}.

2.2 Transformer-branch (Global Feature Extraction via Offset-Attention)

  • Projection: Queries, keys, and values are computed: [Q,K,V]=Fgin[Wq,Wk,Wv][Q,K,V] = F_g^\text{in}[W_q, W_k, W_v] with Q,KRNg×daQ,K \in \mathbb{R}^{N_g \times d_a}, VRNg×deV \in \mathbb{R}^{N_g \times d_e}.
  • Attention Matrix: Aˉ=QK\bar{A} = QK^\top.
  • Offset-Attention Normalization: Double softmax normalization:

α^i,j=exp(αˉi,j)kexp(αˉk,j),αi,j=α^i,jkα^i,k\hat{\alpha}_{i,j} = \frac{\exp(\bar{\alpha}_{i,j})}{\sum_k \exp(\bar{\alpha}_{k,j})},\quad \alpha_{i,j} = \frac{\hat{\alpha}_{i,j}}{\sum_k \hat{\alpha}_{i,k}}

  • Context Aggregation: Fa=AVF_a = A V; residual follows:

Fgout=LBR(FaFgin)+FginF_g^{\text{out}} = \mathrm{LBR}(F_a - F_g^{\text{in}}) + F_g^{\text{in}}

  • Feature Transmission (local→global): Local features are upsampled, projected, and incorporated before QKV calculation.

2.3 Feature Transmission Elements

Two one-way mappings bridge local/global streams:

  • ft1\mathrm{ft}_1 (local→global): Upsample via distance-weighted interpolation, linear + BN to match CgC_g.
  • ft2\mathrm{ft}_2 (global→local): Downsample global features to matched subset, linear + BN to match CoutC_\text{out}. Each ensures both alignment in point count and feature dimensionality, facilitating bi-directional information flow and semantic fusion.

3. CTF-Net Backbone Construction

CTF-Net is realized by stacking L=3L=3 CT-blocks.

3.1 Classification Pipeline

  • Input: FinRN×CF_\text{in} \in \mathbb{R}^{N \times C}, typically C=3C=3 (coordinate).
  • Initial Embeddings: Parallel branches: FPS ⁣×\,\downarrow \!\timesMLP for local (F0F_\ell^0), MLP for global (Fg0F_g^0).
  • Stagewise CT-blocks: Each applies convolution-branch (downsample by $2$, channel-doubling), transformer-branch (dimension preserved).
  • Heads: Local—max-pool FLF_\ell^L then fully-connected; global—concatenate FgiF_g^i, max-pool, and fully-connected.
  • Loss: Dual cross-entropy, outputs averaged in inference.

3.2 Segmentation Pipeline

  • Encoder: Identical structure, up to LL CT-blocks.
  • Decoder: Upsample local features at each level, optional skip connections, final per-point heads; global features concatenated and passed through analogous global head.
  • Loss: Dual per-point cross-entropy; summed prediction at inference.

4. Forward Pass Pseudocode

The core logic of a CT-block forward pass is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def CT_block_forward(F_l_in, F_g_in):
    # Convolution-branch (local)
    sample_idx = FPS(F_l_in, N_l_out)
    F_grp = group_points(F_l_in, sample_idx, S)    # (N_l_out, S, C_l_in)
    F2 = LBR1(F_grp)                               # (N_l_out, S, C2)
    F_g_proj = ft2(F_g_in)                         # (N_l_out, C2)
    F2p = F2 + broadcast(F_g_proj, dim=S)
    F3 = LBR2(F2p)                                 # (N_l_out, S, C_out)
    F_l_out = max_pool(F3, dim="neighbors")        # (N_l_out, C_out)
    # Transformer-branch (global)
    F_loc_proj = ft1(F2)                           # (N_g, C_g_in)
    Q, K, V = linear_qkv(F_g_in + F_loc_proj)
    A = normalize_offset_attention(Q, K)
    F_a = A @ V
    F_g_mid = LBR(F_a - F_g_in) + F_g_in
    F_g_out = F_g_mid
    return F_l_out, F_g_out
(Guo et al., 2021)

5. Hyper-parameters and Training Protocols

  • CT-blocks: L=3L=3 stacked stages.
  • Neighbors per group: S=32S=32.
  • Transformer embedding/attention: de=256d_e=256, da=64d_a=64 (single-head).
  • Channels: CC_\ell doubles each stage, e.g., 326412832 \rightarrow 64 \rightarrow 128.
  • Optimization: SGD, $0.9$ momentum, initial LR 10310^{-3} with cosine annealing.
  • Loss: Dual cross-entropy per head (global/local).
  • Augmentation: For classification, random zz-rotation, jitter σ=0.02\sigma=0.02; for segmentation, anisotropic scaling in [0.8,1.25][0.8,1.25].
  • Segmentation inference: Single- and multi-scale (scales [0.8:0.1:1.25][0.8:0.1:1.25]).

6. Empirical Results and Ablation

6.1 Classification (ModelNet40)

  • Setup: $1024$ points, xyzxyz coordinates.
  • Metrics: overall accuracy (OA), mean class accuracy (mAcc).
  • Performance: OA 93.5%93.5\%, mAcc 90.8%90.8\%; outperforms PointNet++ (91.9%91.9\% OA) and matches PCT (93.2%93.2\% OA).

6.2 Segmentation (ShapeNetPart)

  • Setup: $2048$ points, per-point 50-part label.
  • Result: Part-average IoU (pIoU): 86.5%86.5\% (multi-scale), surpasses PointNet++ (85.1%85.1\%) and matches PCT (86.4%86.4\%).

6.3 Ablation Study

Four variants (ModelNet40 OA / ShapeNetPart pIoU):

Variant ModelNet40 OA ShapeNetPart pIoU
Conv-only 91.82% 85.23%
Transformer-only 91.75% 85.51%
No feature transmission 92.59% 85.70%
Full CT-block 93.52% 86.29%

The feature transmission bridges confer measurable improvement in both accuracy and IoU.

6.4 Hyper-parameter Trade-offs

  • Neighbor count SS: $32$ is optimal for FLOPs/accuracy balance.
  • Embedding dimension ded_e: $256$ is optimal; higher values risk overfitting and cost.

CTF-Net achieves joint local-global feature coupling in point clouds more effectively than single-paradigm backbones or sequential hybridization (e.g., PCT, 3DCTN (Lu et al., 2022)). The empirical results on ModelNet40 and ShapeNetPart demonstrate the practical advantage of bridging local detail and global context via lightweight, learnable feature transmission at each stage. Notably, competitive approaches such as 3DCTN also emphasize this principle, employing interleaved graph-convolutional and transformer modules, but lack explicit mutual feature guidance via dual-direction bridges.

CTF-Net's modularity, efficient computation, and accuracy suggest its deployment across a range of 3D understanding tasks, with the CT-block architecture providing a blueprint for future research in harmonic local-global feature learning (Guo et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rotationally Invariant Features (RIF).