Papers
Topics
Authors
Recent
Search
2000 character limit reached

MambaFormer Block Architecture

Updated 29 December 2025
  • MambaFormer Block is a modular unit that blends state space models with convolutional, attention, or hybrid layers to capture global and local information.
  • It employs multi-path serialization techniques, using space-filling curves, to transform unordered 3D point clouds into ordered sequences for efficient processing.
  • The design achieves linear computational complexity and enhanced performance in tasks like 3D segmentation compared to traditional quadratic attention mechanisms.

A MambaFormer Block is a modular architectural unit that integrates State Space Model (SSM) operations, specifically Mamba-style selective scan modules, with task-specific logic such as convolutional, attention, or hybrid interaction layers. These blocks have emerged as a key building block in contemporary research to enable efficient long-range modeling in domains where classic Transformer attention is computationally impractical or poorly aligned with data structure, such as 3D point clouds, multiscale vision, or spatially unordered modalities. The precise block structure varies by application, but common characteristics include an SSM core for global modeling, optional local context modules (e.g., sparse convolution or region-local SSMs), serialization or regionization logic for token sequences, and streamlined residual/normalization pipelines. This entry focuses primarily on the ConvMamba (“MambaFormer”) block as formalized for 3D point cloud processing in "Pamba: Enhancing Global Interaction in Point Clouds via State Space Model" (Li et al., 2024).

1. State Space Model Fundamentals in the MambaFormer Block

At the core of the MambaFormer Block is a continuous-time SSM of the form: h(t)=Ah(t)+Bx(t),y(t)=Ch(t)h'(t) = A\,h(t) + B\,x(t), \quad y(t) = C\,h(t) where x(t)Rdx(t)\in\mathbb{R}^d is the input at sequence position tt, h(t)Rrh(t)\in\mathbb{R}^r is the hidden state, and y(t)Rcy(t)\in\mathbb{R}^c is the output. AA, BB, and CC are learnable matrices.

For discrete input sequences, Mamba uses zero-order hold discretization: Aˉ=exp(ΔA),Bˉ=(exp(ΔA)I)A1B,Cˉ=C\bar{A} = \exp(\Delta A),\quad \bar{B} = (\exp(\Delta A) - I)A^{-1}B,\quad \bar{C} = C This yields a causal 1D convolution: yt=k=0t1K[k]xtk,K=[CBˉ,CAˉBˉ,,CAˉk1Bˉ]y_t = \sum_{k=0}^{t-1} K[k] x_{t-k},\quad K = [C\bar{B},\,C\bar{A}\bar{B},\, \ldots,\,C\bar{A}^{k-1}\bar{B}] In practice, Mamba blocks augment this with selective gating, so kernel weights are modulated by input, yet the complexity remains O(Nrd)\mathcal{O}(Nrd) rather than quadratic in sequence length.

2. 1D Serialization of Unordered 3D Point Clouds

A prerequisite for SSM operation is a well-defined sequence. Point clouds, being unordered sets P={pi=(xi,yi,zi)}P = \{p_i = (x_i, y_i, z_i)\}, must be serialized. The ConvMamba block as deployed in Pamba (Li et al., 2024) employs multi-path serialization:

  • Space-filling curves: Hilbert (XYZ, YXZ) and Z-order (Morton code, XYZ, YXZ).
  • Procedure: For each pip_i, compute voxel grid coordinates, map to a scalar key via the chosen curve, and sort points accordingly.
  • Mixing: At each block, select a random serialization according to a specified ratio to minimize ordering bias and artifacts.
  • Deserialization: After SSM, outputs are reordered to original spatial arrangement for further stages.

Pseudocode for serialization:

1
2
3
4
5
6
7
def serialize(P, curveType):
    for i in range(N):
        (u, v, w) = floor((P[i] - minCorner) / voxelSize)
        key[i] = spaceFillingKey((u, v, w), curveType)
    idx = argsort(key)
    X_seq = [P[idx[0]], ..., P[idx[N-1]]]
    return X_seq, idx

3. ConvMamba Block Architecture: Local-Global Two-Stage Pipeline

The ConvMamba block structure consists of two sequential aggregation stages:

  • Local Aggregation
    • Input: FinRN×CF_{\mathrm{in}}\in\mathbb{R}^{N\times C} as a sparse 3D tensor via voxelization.
    • Operations: One or more 3×33\times 3 submanifold sparse convolution layers (stride 1), with BatchNorm and ReLU.
    • Output: FlocF_{\mathrm{loc}} (optionally skip-connected with FinF_{\mathrm{in}}).
  • Global Aggregation
    • LayerNorm applied to FlocF_{\mathrm{loc}} yielding URN×CU\in\mathbb{R}^{N\times C}.
    • Multi-path serialization: [Useq,idx]=serialize(U, curveSelector())[U_{\mathrm{seq}},\,\mathrm{idx}] = \mathrm{serialize}(U,\ \mathrm{curveSelector}()).
    • Bidirectional MambaSSM: forward and backward causal scans, outputs averaged.
    • Scatter outputs back to original indices.
    • Residual connections: Fssm=U+YF_{\mathrm{ssm}} = U + Y.
    • Two-layer MLP (with pre-norm): Fmlp=Fssm+MLP(LayerNorm(Fssm))F_{\mathrm{mlp}} = F_{\mathrm{ssm}} + \mathrm{MLP}(\mathrm{LayerNorm}(F_{\mathrm{ssm}})).
    • Final residual: Fout=Fin+FmlpF_{\mathrm{out}} = F_{\mathrm{in}} + F_{\mathrm{mlp}}.

Forward pass pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def ConvMambaBlock(F_in, curveSelector):
    # Local aggregation
    F_loc = F_in
    for i in range(L_local): # L_local = 2 typically
        F_loc = SparseConv3x3(F_loc)
        F_loc = BatchNorm(F_loc)
        F_loc = ReLU(F_loc)
    F_loc = F_loc + F_in

    # Global SSM
    U = LayerNorm(F_loc)
    U_seq, idx = serialize(U, curveSelector())
    Y_fwd = MambaSSM(U_seq)
    Y_bwd = reverse(MambaSSM(reverse(U_seq)))
    Y_seq = (Y_fwd + Y_bwd) / 2
    Y = scatter_back(Y_seq, idx)
    F_ssm = U + Y

    # MLP
    F_mlp = F_ssm + MLP(LayerNorm(F_ssm))
    return F_in + F_mlp

4. Computational Complexity and Empirical Analysis

In terms of computational scaling:

Block Type Complexity Typical Runtime (ScanNet200, N1.5×105N\approx 1.5\times 10^{5}, C=256C=256) Memory Use
Transformer O(N2C)\mathcal{O}(N^{2}C) $357$ ms (train), $183$ ms (infer) $9.5$ GB (train), $9.3$ GB (infer)
ConvMamba (Pamba) O(NC)\mathcal{O}(NC) $296$ ms (train), $120$ ms (infer) $5.2$ GB (train), $4.8$ GB (infer)

The linear-time global aggregation and efficient local convolution permit global context modeling over entire large-scale scenes, with substantially reduced memory and runtime relative to attention-based methods (Li et al., 2024).

5. Integration, Variants, and Generalizations

The ConvMamba block is the canonical MambaFormer block specifically tailored for point cloud domains. The general design principle—splitting modeling into local (often convolutional) and global (SSM) steps, leveraging problem-agnostic serialization or tokenization—has been adapted across variants:

  • In Pamba (Li et al., 2024), serialization is designed for unordered 3D point sets.
  • In image or burst vision, line and flow-based serialization adapts the SSM to grids or motion fields.
  • In multimodal or voxelized domains, hybrid block gates or reshuffles balance cross-modal or spatio-temporal flows.

A plausible implication is that successful deployment of SSM-based global modeling depends as much on the serialization and local-to-global bridging logic as on the choice of SSM dynamics or gating itself.

6. Empirical Impact in 3D Scene Understanding

The ConvMamba block enabled the Pamba framework to set state-of-the-art results on several 3D point cloud segmentation benchmarks, such as ScanNet v2, ScanNet200, S3DIS, and nuScenes (Li et al., 2024). The ability to scale global modeling to N105N\sim 10^5 points per scene with linear time and memory cost is a critical technical advancement relative to previous quadratic-attention architectures, which were computationally prohibitive for whole-scene processing. The multi-path serialization strategy is also integral, as it mitigates ordering-induced bias inherent to SSMs on unordered data.

7. Contextual Significance and Relation to Other MambaFormer Instantiations

While the ConvMamba block is rooted in point cloud processing, the MambaFormer paradigm underpins a diverse family of architectures sharing the SSM-centric, hybrid-local-global block structure. Table below highlights the distinguishing features of several MambaFormer block variants (not exhaustive):

Domain Local Branch/Tokenization Global Branch Serialization/Order Logic
Point Cloud (Pamba) Sparse 3D convolution Mamba SSM 3D space-filling curve mix
Image/Video Patch or convolutional grouping Mamba SSM Row/column, flow paths
Multimodal (fusion) Modality-aligned pre-projection SSM (multi-modal) Modality/token sequence

This suggests ongoing research is converging on block-level SSM + local context hybrids as a foundational mechanism across vision, scene understanding, and multimodal integration, provided that order-preserving serialization or grouping is carefully matched to the target domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MambaFormer Block.