State Space Models for Point Clouds

Updated 21 February 2026

State Space Models in Point Clouds are frameworks that adapt continuous-time linear systems to unordered data, achieving efficient O(N) processing.
Serialization techniques like space-filling curves and clustering convert 3D point sets into 1D sequences while preserving spatial proximity.
Hybrid architectures combine global SSMs with local modules (e.g., graph pooling, convolution) to enhance segmentation, denoising, and spatio-temporal modeling.

State Space Models (SSMs) have recently become a major backbone for processing point clouds, achieving linear complexity in the number of points while maintaining or surpassing the representational power of transformer-based architectures with their quadratic attention mechanisms. The central challenge is reconciling the unordered, irregular nature of point clouds with the inherently sequential processing required by SSMs. Recent innovations including serialization strategies, geometry-aware parameterizations, bidirectional recurrences, and hybrid local-global modules enable SSMs to serve as efficient and highly expressive architectures for point cloud classification, segmentation, upsampling, denoising, and spatio-temporal modeling.

1. State Space Model Frameworks for Point Clouds

The foundational SSM formulation used in modern point cloud backbones adapts the continuous-time linear dynamical system:

$\dot{h}(t) = A h(t) + B x(t),\qquad y(t) = C h(t) + D x(t)$

Discretization (commonly via zero-order hold) yields recurrence relations for sequence processing:

$h_k = \bar{A} h_{k-1} + \bar{B} x_k,\qquad y_k = \bar{C} h_k$

Input-adaptive (selective) scanning as in Mamba parameterizes the update matrices $A$ , $B$ , $C$ , and even the discretization step $\Delta$ as functions of input features, allowing token-dependent dynamics and receptive fields (Liu et al., 2024). The learned convolutional kernel $\bar{K}$ can be expressed as a 1D convolution along the serialized point cloud:

$y = \bar{K} * x,\qquad \bar{K} = [C \bar{B},\, C \bar{A}\bar{B},\, \ldots,\, C \bar{A}^{M-1}\bar{B}]$

Unique adaptations to the point domain include bidirectional recurrences for enhanced context (Chen et al., 2024, Han et al., 2024), dual-scale or hierarchical SSM blocks (Zhang et al., 2024, Zhang et al., 17 Apr 2025), and hybrid local-global fusion modules to preserve local geometric inductive bias (Liu et al., 16 May 2025, Zhang et al., 17 Apr 2025).

2. Serialization and Geometric Ordering Strategies

SSMs require a 1D (causal) sequence, but point clouds lack any canonical ordering. Various strategies have been developed to serialize points while preserving spatial locality and geometric structure:

Space-filling curves: Morton (z-order), Hilbert, and other curves preserve 3D proximity in the serialized sequence (Liu et al., 2024, Zhang et al., 2024, Köprücü et al., 2024). Key construction steps include octree partitioning and key assignment per point.
Consistent Traverse Serialization (CTS): Axis-permuted “snake” voxelizations and six-way permutations for data augmentation yield complementary orderings, aided by order prompts (Zhang et al., 2024).
Clustering and prompt-guided grouping: UST-SSM leverages feature-driven clustering followed by intra-cluster Hilbert ordering, enabling both semantic and geometric context to be preserved in spatio-temporal SSMs (Li et al., 20 Aug 2025).
Proximity-based sorting: NIMBA enforces that each pair of consecutive tokens in the sequence are spatially proximal via iterative neighbor swaps, yielding invariance to rigid transformations and eliminating the need for positional encodings (Köprücü et al., 2024).
Multi-path serialization: Aggregation of several orderings (e.g., Hilbert, z-order, axis-swaps) improves robustness and receptive field (Li et al., 2024, Zhang et al., 17 Apr 2025).

These techniques are essential for bridging the gap between set-structured point clouds and the required sequential dynamics of SSMs.

3. Local-Global Context Modeling and Hybrid Modules

While SSMs offer efficient global context aggregation, detailed local geometric structure is often diluted in serialization. Recent architectures explicitly restore (or couple) local and global context:

Hybrid blocks: PillarMamba’s Hybrid State-space Block (HSB) interleaves large-receptive-field SSMs (on serialized BEV or voxel grids) with depth-wise convolution and squeeze-and-excitation for local enhancement and memory preservation (Zhang et al., 8 May 2025).
Geometry-feature coupling: HyMamba introduces geometry-feature coupled pooling via differentiable, neighborhood-aware weighting and channel-wise fusion, overcoming the loss of explicit local relationships in pure sequence modeling (Liu et al., 16 May 2025).
Local Norm Pooling and EdgeConv: Mamba3D uses k-NN graph-based local normalization and pooling, while 3DMambaIPF employs Dynamic EdgeConv for local geometry refinement during denoising (Han et al., 2024, Zhou et al., 2024).
Dual or hierarchical SSMs: Voxel Mamba’s dual-scale block combines fine-grained and coarse SSMs via separate serializations on high- and downsampled grids, integrated via residuals (Zhang et al., 2024). Hierarchical schemes based on FPS sampling and KNN at multiple levels facilitate progressive learning from local to global context (Zhang et al., 17 Apr 2025).

Hybridization consistently improves descriptive power, reduces oversmoothing, and maintains fidelity to geometric inductive biases.

4. Spatio-Temporal and Video Point Cloud SSMs

For dynamic or 4D point clouds, SSMs are extended along both spatial and temporal axes:

Disentangled spatial-temporal SSMs: MAMBA4D separates intra-frame spatial SSMs (on clips or tubes from anchor frames) and inter-frame temporal SSMs (on serialized anchor features), each with bidirectional or cross-temporal scans to maximize receptive field (Liu et al., 2024).
Prompt-guided semantic ordering: UST-SSM’s Spatial-Temporal Selection Scanning (STSS) integrates point-level clustering prompts to form semantically-meaningful input sequences for long-range SSM modeling (Li et al., 20 Aug 2025).
Aggregation and sampling: Temporal Interaction Sampling (TIS) strengthens fine-grained temporal modeling by including both anchor and non-anchor frames; Spatio-Temporal Structure Aggregation (STSA) adds a parallel 4D KNN module to recover missing geometric and motion detail.

These designs underpin state-of-the-art results on point cloud video understanding, with strong gains in efficiency—up to 87.5% GPU memory reduction and 5.36× inference speedup compared to quadratic-attention models (Liu et al., 2024, Li et al., 20 Aug 2025).

5. Efficiency, Complexity, and Scalability

A decisive advantage of SSMs in point-cloud modeling is true $O(N)$ runtime and memory:

Complexity: Kernelized SSM blocks (usually with fixed small kernel length $M$ ) scale as $h_k = \bar{A} h_{k-1} + \bar{B} x_k,\qquad y_k = \bar{C} h_k$ 0 for $h_k = \bar{A} h_{k-1} + \bar{B} x_k,\qquad y_k = \bar{C} h_k$ 1 tokens, $h_k = \bar{A} h_{k-1} + \bar{B} x_k,\qquad y_k = \bar{C} h_k$ 2 feature channels, and $h_k = \bar{A} h_{k-1} + \bar{B} x_k,\qquad y_k = \bar{C} h_k$ 3 (Liu et al., 2024, Zhang et al., 2024). Self-attention remains $h_k = \bar{A} h_{k-1} + \bar{B} x_k,\qquad y_k = \bar{C} h_k$ 4.
Hardware efficiency: Group-free serialization and parallel scan algorithms (e.g., Blelloch scan) enable batch-parallel SSM execution and facilitate custom CUDA kernels (Schöne et al., 2024).
Empirical scaling: Models such as MBPU process 65K points on a single GPU, with negligible degradation in CD/F-score/P2F as $h_k = \bar{A} h_{k-1} + \bar{B} x_k,\qquad y_k = \bar{C} h_k$ 5 increases, outperforming both transformers and convolutional networks that crash or degrade beyond $h_k = \bar{A} h_{k-1} + \bar{B} x_k,\qquad y_k = \bar{C} h_k$ 6K (Song et al., 2024).
Memory: GPU memory usage is perfectly linear in $h_k = \bar{A} h_{k-1} + \bar{B} x_k,\qquad y_k = \bar{C} h_k$ 7 in PointMamba and similar architectures, enabling deployment in large-scale or real-time systems (Liu et al., 2024, Liang et al., 2024, Zhang et al., 17 Apr 2025).

Efficiency gains are especially notable in point cloud denoising and upsampling pipelines, where SSMs with differentiable rendering loss outperform prior methods on both accuracy and scalability (Zhou et al., 2024, Song et al., 2024).

6. Experimental Results and Limitations

State space model backbones achieve state-of-the-art or competitive results across standard benchmarks:

Method	ModelNet40 (%)	ScanObjectNN (%)	S3DIS (mIoU)	Notable Efficiency/Scale
PointMamba	93.4 [C]	92.7 (O)	75.7	1.31G FLOPs < PCT’s memory usage
HyMamba	95.99	94.34	–	Only 0.03M extra params (GFCP+CoFE)
Voxel Mamba	79.6 (L1 APH)	–	–	27 FPS, 3.7GB GPU / Waymo; group-free
Mamba3D	95.1 (pretr.)	92.6	85.7	Linear, strong for few-shot
NIMBA	92.1	89.06	84.36	No positional encoding; invariant
Pamba	–	–	77.6	296 ms/train, 183 ms/infer (ScanNet)
MBPU (upsample)	–	–	–	O(N), smooth scaling up to 65K points

While SSMs match or surpass prior state-of-the-art models, several limitations remain:

SSMs are sensitive to serialization order; suboptimal curves or scan patterns can reduce locality and performance (Liu et al., 2024, Zhang et al., 2024). Morton order, Hilbert, and proximity-based orderings are currently heuristic.
Local geometries may require explicit hybridization modules (GFCP, LNP, EdgeConv, HSB), as SSMs alone are prone to over-smoothing or underfitting local signals (Liu et al., 16 May 2025, Han et al., 2024, Zhang et al., 8 May 2025).
Some models (e.g., NIMBA) achieve permutation invariance only up to rigid transformations, not full set symmetry, though positional embeddings are not required (Köprücü et al., 2024).
SSMs assume linear sequential recurrence; highly intricate topologies (e.g., tubes, vessels in medical shapes) or unordered groups may challenge single-pass processing (Zhang et al., 17 Apr 2025).

7. Future Directions

Several promising research avenues are outlined:

Learned or dynamic scan ordering: Designing data-dependent or task-optimized serialization could surpass fixed space-filling curves (Liu et al., 2024, Zhang et al., 2024, Liu et al., 16 May 2025).
Self-supervised and generative modeling: Extending SSMs to generative or pretext objectives for points (e.g., masked autoencoding, geometry synthesis) is largely unexplored (Han et al., 2024, Zhang et al., 17 Apr 2025).
4D and multimodal fusion: Integrating SSMs for multi-sensor and temporal scenarios — extending the recurrence over both space and time (or other modalities) (Liu et al., 2024, Li et al., 20 Aug 2025, Zhang et al., 8 May 2025).
Sparse and scalable variants: Direct sparse SSMs for raw point sets or hybrid CNN–SSM architectures for further efficiency (Zhang et al., 8 May 2025, Zhang et al., 2024).
Enhanced local modules: Further refinement of hybrid/geometry-aware pooling, especially for fine-scale, high-curvature or topologically complex regions (Liu et al., 16 May 2025, Han et al., 2024).

This body of work demonstrates that SSMs, when carefully serialized and hybridized with local modules, represent a generic, scalable, and competitive backbone for a wide range of point cloud processing tasks, including in high-throughput and large-scale settings (Liu et al., 2024, Zhang et al., 2024, Köprücü et al., 2024).