Hierarchical Grouped-MLP Fusion

Updated 9 February 2026

Hierarchical grouped-MLP fusion is a neural network design paradigm that partitions high-dimensional inputs into groups and employs hierarchically structured MLP mixing to capture multi-level interactions.
It leverages learnable grouping, blockwise local mixing, and pyramid-like aggregation to drastically reduce parameter counts and computational costs compared to dense MLPs.
This approach has shown empirical success in vision, sequence modeling, tabular data, and multimodal tasks, achieving state-of-the-art results with efficient resource use.

Hierarchical grouped-MLP fusion refers to a family of neural network design paradigms in which multilayer perceptrons (MLPs) are hierarchically organized and fused across feature, spatial, or modal groupings, often with learnable grouping, pooling, or grouping-permutation strategies. This class includes architectures such as Group-Connected MLPs, CS-Mixer, Butterfly/Dimension Mixers, and gMLP-based hierarchical vision models, each leveraging structured group-wise mixing and hierarchical pooling or merging to increase efficiency, expressive power, or multi-level interaction—all without global, fully dense connections. Core to this paradigm are learnable or structured grouping operations, local groupwise MLP interactions, and the aggregation of higher-order dependencies via hierarchical tree-like or pyramid mechanisms.

1. Core Principles of Hierarchical Grouped-MLP Fusion

The central principle underlying hierarchical grouped-MLP fusion is partitioning a high-dimensional input (features, tokens, modalities, or spatial locations) into groups, applying trainable MLP-based mixing or gating within each group, and hierarchically merging or permuting groups at deeper layers. This process can be realized with:

Learned or structured group assignments: Via softmaxed grouping matrices with entropy regularization, fixed windowing, or pre-specified blockings.
Local group-wise MLP fusion: MLPs operate only on group-contained features or tokens, reducing parameter and computational complexity.
Hierarchical aggregation: Deeper layers merge, pool, or mix outputs from child groups, forming tree or pyramid hierarchies capturing high-order dependencies.
Cross-scale and multi-level fusion: Multi-stage or skip-connected groupwise fusions enable both fine and global interactions.

These strategies directly contrast the all-to-all connectivity of traditional dense MLPs, offering efficiency, modularity, and—when groups and their merging are engineered or learned judiciously—state-of-the-art performance in domains ranging from tabular data to large-scale vision tasks (Kachuee et al., 2019, Cui et al., 2023, Sapkota et al., 2023, Go et al., 2022).

2. Learned Grouping and Local Mixing Mechanisms

Several methodologies exploit learnable grouping for fusing input features:

Soft Grouping Matrices: In Group-Connected MLPs (GMLP), each layer uses a soft grouping matrix $G^t \in \mathbb{R}^{d \times g_t}$ parameterized by logits $Z^t$ and temperature $T$ . As $T \to 0$ , features are essentially assigned to a single group, promoting hard partitions. An entropy penalty $-\sum_{i,j} G^t_{ij} \log G^t_{ij}$ is added to the loss, balancing sharp assignments against smooth optimization (Kachuee et al., 2019).
Blockwise and Windowed Local MLPs: In gSwin, the spatial feature map is partitioned into non-overlapping or shifted windows, with each window processed by a windowed spatial gating unit (SGU) that mixes tokens/grouped locations via an MLP and gating mechanism. The SGU’s gating matrix is block-diagonal over all windows, keeping parameter growth in check (Go et al., 2022).
Butterfly/Dimension Groupings: Hierarchically permuted groupings, inspired by the FFT’s butterfly structure, perform blockwise MLP mixing with a new group structure at each stage, such that after $O(\log_r N)$ steps all coordinates can communicate globally. These groupings are determined by radix- $r$ representations of indices and systematic permutations (Sapkota et al., 2023).
Low-Rank and Grouped Spatio-Channel Mixing: CS-Mixer splits spatial tokens into groups in both local and global configurations and applies low-rank, multi-head MLP mixing in each, capturing interactions across height, width, and channels jointly via grouped transforms (Cui et al., 2023).

These mechanisms impose and exploit structural sparsity and partitioning for scalable, expressive mixing, and group-based regularization.

3. Hierarchical Fusion and Aggregation Structures

Hierarchical grouped-MLP fusion leverages tree- or pyramid-like aggregation, enabling:

Tree-like Pooling Hierarchies: After groupwise MLP mixing, GMLP merges groups in a binary-tree fashion using mean, max, or linear pooling operators ( $P^t$ ). With each pooling, the number of groups reduces by half, increasing group capacity at coarser levels. The top-level concatenation feeds into a final output MLP (Kachuee et al., 2019).
Pyramidal Staging in Vision: Models like gSwin and CS-Mixer use multi-stage spatial downsampling (via patch merging or pooling) to form spatial and channel pyramids. This yields a multiscale hierarchy akin to CNN backbones but relies on grouped MLP or gating fusion within each stage (Go et al., 2022, Cui et al., 2023).
Repeated Permutation-Hierarchies (Butterfly): Dimension Mixer/Butterfly MLPs recursively permute and hierarchically regroup dimensions according to radix-decompositions, ensuring efficient all-to-all communication with O(log N) mixing depth, and groupwise local MLPs per permutation stage (Sapkota et al., 2023).
Dense Stacking Across Modalities: In dense multimodal fusion, shared MLP fusion layers are stacked between unimodal subnetworks at every depth, with skip and dense connections among all shared fusion points, allowing multi-level cross-modal supervision and gradient propagation (Hu et al., 2018).

These hierarchical strategies ensure that higher-level units integrate increasingly global contextual information, either spatially, across features, or across modalities.

4. Computational and Memory Efficiency

Hierarchical grouped-MLP fusion enables dramatic reductions in parameter count and FLOPs compared to dense MLPs:

Complexity Scaling: The Butterfly/Dimension Mixer achieves total parameters $O(N m \log_r N)$ versus dense $O(N^2)$ for fixed per-block hidden size $m$ (e.g., $m=r$ ). FLOPs match, with $O(N r \log_r N)$ versus dense $O(N^2)$ . Empirically, 4× lower MACs and parameter count can be attained at minimal performance loss (e.g., CIFAR-10 with 144K params and 5M MACs for Butterfly vs. 352K/20M for dense, with <1% accuracy drop) (Sapkota et al., 2023).
Groupwise and Block-diagonal Efficiency: In gSwin, gating weight matrices scale as $K \cdot (M^2 \cdot M^2 + M^2)$ per block (e.g., $12 \cdot 2450=29400$ for $M=7$ , $K=12$ ), much lower than global $N_i^2$ when $N_i$ is large (Go et al., 2022).
Low-Rank Grouped Mixing: CS-Mixer's low-rank, grouped mixing attains $O(m N (c^2 + c d + L^2 d^2))$ per layer, with $L=g^2 \ll N$ , and $d\ll c$ , avoiding the $O(N^2 c^2)$ explosion of naïve 3-axis full mixing (Cui et al., 2023).
Sparse Activation Storage: With nearly one-hot group assignment matrices, GMLP’s memory for assignment storage is $O(nd)$ in lower layers and $O(g_{t-1} g_t)$ in higher ones, offering linear scale with group count rather than quadratic with dimension (Kachuee et al., 2019).

A plausible implication is that such architectures are especially advantageous for large input spaces, long sequences, or high-resolution visual inputs, where dense all-to-all mixing is prohibitive or over-parameterized.

5. Empirical Performance and Domain Applications

Hierarchical grouped-MLP fusion has been applied and evaluated across several domains:

Tabular and Non-structured Data: GMLP attains state-of-the-art classification performance on tabular and mixed-feature datasets, automatically discovering expressive feature subsets and interactions (Kachuee et al., 2019).
Computer Vision:
- gSwin (“gated MLP in shifted window hierarchy”) achieves higher accuracy than equivalently sized Swin Transformers on ImageNet-1K (+0.4%), COCO detection (+0.45 box-AP) and ADE20K (+1.8 mIoU), with ~20% fewer parameters (Go et al., 2022).
- CS-Mixer demonstrates ImageNet-1k top-1 accuracy of 83.2% at 13.7 GFLOPs and 94M parameters (CS-Mixer-L), surpassing same-compute 2-axis mixers by >1.5% and matching or exceeding axial/H–C mixers with much lower compute (Cui et al., 2023).
Sequence Modeling and Efficient Attention: Butterfly-MLP and Butterfly Attention in Vision Transformers and LRA tasks provide sub-quadratic scaling (e.g., 20% lower memory, up to +2% accuracy versus dense attention) and unique ability to handle sequence lengths up to 16K+ (Sapkota et al., 2023).
Multimodal Fusion: Dense Multimodal Fusion (DMF) leverages hierarchically layered shared MLP fusion across modalities, resulting in richer joint representations and accelerated convergence, particularly when one modality is noisy or absent (Hu et al., 2018).

A plausible implication is that hierarchical grouped-MLP fusion enhances robustness and generalization, particularly in settings with heterogeneous input structure or multi-resolution dependencies.

6. Structural Attributes and Generalization Extensions

Hierarchical grouped-MLP fusion subsumes and generalizes numerous classical and contemporary ideas:

Permutation and Grouping Flexibility: The index-based groupings in Butterfly/Dimension Mixers are theoretically universal for invertible transformations, recovering the full connectivity of dense MLPs as group size and depth scale (Sapkota et al., 2023).
Multi-head and Multi-modality: Grouped mixing can be extended with multi-head gating (as in gSwin) or with task-specific grouping functions for multi-task or cross-modal learning (Go et al., 2022, Kachuee et al., 2019).
Shifted and Shift-free Grouping: Alternation of standard and cyclically shifted window partitions in vision ensures eventual full neighborhood interaction across spatial locations (Go et al., 2022).
Low-Rank and Dynamic Mixers: Low-rank grouped mixing (CS-Mixer) balances capacity and efficiency, with the rank $d$ and group size $g$ as critical hyperparameters for trade-offs in expressive power versus compute (Cui et al., 2023).
Generalization to Nonlinear and Hierarchical Domains: The same principles extend across domains—tabular, sequential, visual, and multimodal—with appropriate design of groupings, local mixing blocks, and hierarchical aggregation.

7. Representative Architectures: Comparison Table

Below is a summary contrasting representative hierarchical grouped-MLP fusion models and their key attributes:

Model	Grouping Strategy	Hierarchical Fusion	Core Domain
GMLP (Kachuee et al., 2019)	Learnable, sparse softmax groups	Tree-like pooling, pooling after MLP	Tabular, general
gSwin (Go et al., 2022)	Non-overlapping and shifted windows	Multi-stage spatial pyramid	Vision
CS-Mixer (Cui et al., 2023)	Local/global grouped, low-rank	Four-stage cross-scale hierarchy	Vision
Butterfly MLP (Sapkota et al., 2023)	Radix-based, permuted groupings	Log-depth permutation-mixing tree	Sequence, vision
DMF (Hu et al., 2018)	Dense shared MLP layers between modalities	Hierarchical shared-layer stacking	Multimodal fusion

Together, these architectures demonstrate the flexibility and effectiveness of hierarchical grouped-MLP fusion as a general deep learning design paradigm, with efficiency, expressive capacity, and domain adaptability at its core.