Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical Grouped-MLP Fusion

Updated 9 February 2026
  • Hierarchical grouped-MLP fusion is a neural network design paradigm that partitions high-dimensional inputs into groups and employs hierarchically structured MLP mixing to capture multi-level interactions.
  • It leverages learnable grouping, blockwise local mixing, and pyramid-like aggregation to drastically reduce parameter counts and computational costs compared to dense MLPs.
  • This approach has shown empirical success in vision, sequence modeling, tabular data, and multimodal tasks, achieving state-of-the-art results with efficient resource use.

Hierarchical grouped-MLP fusion refers to a family of neural network design paradigms in which multilayer perceptrons (MLPs) are hierarchically organized and fused across feature, spatial, or modal groupings, often with learnable grouping, pooling, or grouping-permutation strategies. This class includes architectures such as Group-Connected MLPs, CS-Mixer, Butterfly/Dimension Mixers, and gMLP-based hierarchical vision models, each leveraging structured group-wise mixing and hierarchical pooling or merging to increase efficiency, expressive power, or multi-level interaction—all without global, fully dense connections. Core to this paradigm are learnable or structured grouping operations, local groupwise MLP interactions, and the aggregation of higher-order dependencies via hierarchical tree-like or pyramid mechanisms.

1. Core Principles of Hierarchical Grouped-MLP Fusion

The central principle underlying hierarchical grouped-MLP fusion is partitioning a high-dimensional input (features, tokens, modalities, or spatial locations) into groups, applying trainable MLP-based mixing or gating within each group, and hierarchically merging or permuting groups at deeper layers. This process can be realized with:

  • Learned or structured group assignments: Via softmaxed grouping matrices with entropy regularization, fixed windowing, or pre-specified blockings.
  • Local group-wise MLP fusion: MLPs operate only on group-contained features or tokens, reducing parameter and computational complexity.
  • Hierarchical aggregation: Deeper layers merge, pool, or mix outputs from child groups, forming tree or pyramid hierarchies capturing high-order dependencies.
  • Cross-scale and multi-level fusion: Multi-stage or skip-connected groupwise fusions enable both fine and global interactions.

These strategies directly contrast the all-to-all connectivity of traditional dense MLPs, offering efficiency, modularity, and—when groups and their merging are engineered or learned judiciously—state-of-the-art performance in domains ranging from tabular data to large-scale vision tasks (Kachuee et al., 2019, Cui et al., 2023, Sapkota et al., 2023, Go et al., 2022).

2. Learned Grouping and Local Mixing Mechanisms

Several methodologies exploit learnable grouping for fusing input features:

  • Soft Grouping Matrices: In Group-Connected MLPs (GMLP), each layer uses a soft grouping matrix GtRd×gtG^t \in \mathbb{R}^{d \times g_t} parameterized by logits ZtZ^t and temperature TT. As T0T \to 0, features are essentially assigned to a single group, promoting hard partitions. An entropy penalty i,jGijtlogGijt-\sum_{i,j} G^t_{ij} \log G^t_{ij} is added to the loss, balancing sharp assignments against smooth optimization (Kachuee et al., 2019).
  • Blockwise and Windowed Local MLPs: In gSwin, the spatial feature map is partitioned into non-overlapping or shifted windows, with each window processed by a windowed spatial gating unit (SGU) that mixes tokens/grouped locations via an MLP and gating mechanism. The SGU’s gating matrix is block-diagonal over all windows, keeping parameter growth in check (Go et al., 2022).
  • Butterfly/Dimension Groupings: Hierarchically permuted groupings, inspired by the FFT’s butterfly structure, perform blockwise MLP mixing with a new group structure at each stage, such that after O(logrN)O(\log_r N) steps all coordinates can communicate globally. These groupings are determined by radix-rr representations of indices and systematic permutations (Sapkota et al., 2023).
  • Low-Rank and Grouped Spatio-Channel Mixing: CS-Mixer splits spatial tokens into groups in both local and global configurations and applies low-rank, multi-head MLP mixing in each, capturing interactions across height, width, and channels jointly via grouped transforms (Cui et al., 2023).

These mechanisms impose and exploit structural sparsity and partitioning for scalable, expressive mixing, and group-based regularization.

3. Hierarchical Fusion and Aggregation Structures

Hierarchical grouped-MLP fusion leverages tree- or pyramid-like aggregation, enabling:

  • Tree-like Pooling Hierarchies: After groupwise MLP mixing, GMLP merges groups in a binary-tree fashion using mean, max, or linear pooling operators (PtP^t). With each pooling, the number of groups reduces by half, increasing group capacity at coarser levels. The top-level concatenation feeds into a final output MLP (Kachuee et al., 2019).
  • Pyramidal Staging in Vision: Models like gSwin and CS-Mixer use multi-stage spatial downsampling (via patch merging or pooling) to form spatial and channel pyramids. This yields a multiscale hierarchy akin to CNN backbones but relies on grouped MLP or gating fusion within each stage (Go et al., 2022, Cui et al., 2023).
  • Repeated Permutation-Hierarchies (Butterfly): Dimension Mixer/Butterfly MLPs recursively permute and hierarchically regroup dimensions according to radix-decompositions, ensuring efficient all-to-all communication with O(log N) mixing depth, and groupwise local MLPs per permutation stage (Sapkota et al., 2023).
  • Dense Stacking Across Modalities: In dense multimodal fusion, shared MLP fusion layers are stacked between unimodal subnetworks at every depth, with skip and dense connections among all shared fusion points, allowing multi-level cross-modal supervision and gradient propagation (Hu et al., 2018).

These hierarchical strategies ensure that higher-level units integrate increasingly global contextual information, either spatially, across features, or across modalities.

4. Computational and Memory Efficiency

Hierarchical grouped-MLP fusion enables dramatic reductions in parameter count and FLOPs compared to dense MLPs:

  • Complexity Scaling: The Butterfly/Dimension Mixer achieves total parameters O(NmlogrN)O(N m \log_r N) versus dense O(N2)O(N^2) for fixed per-block hidden size mm (e.g., m=rm=r). FLOPs match, with O(NrlogrN)O(N r \log_r N) versus dense O(N2)O(N^2). Empirically, 4× lower MACs and parameter count can be attained at minimal performance loss (e.g., CIFAR-10 with 144K params and 5M MACs for Butterfly vs. 352K/20M for dense, with <1% accuracy drop) (Sapkota et al., 2023).
  • Groupwise and Block-diagonal Efficiency: In gSwin, gating weight matrices scale as K(M2M2+M2)K \cdot (M^2 \cdot M^2 + M^2) per block (e.g., 122450=2940012 \cdot 2450=29400 for M=7M=7, K=12K=12), much lower than global Ni2N_i^2 when NiN_i is large (Go et al., 2022).
  • Low-Rank Grouped Mixing: CS-Mixer's low-rank, grouped mixing attains O(mN(c2+cd+L2d2))O(m N (c^2 + c d + L^2 d^2)) per layer, with L=g2NL=g^2 \ll N, and dcd\ll c, avoiding the O(N2c2)O(N^2 c^2) explosion of naïve 3-axis full mixing (Cui et al., 2023).
  • Sparse Activation Storage: With nearly one-hot group assignment matrices, GMLP’s memory for assignment storage is O(nd)O(nd) in lower layers and O(gt1gt)O(g_{t-1} g_t) in higher ones, offering linear scale with group count rather than quadratic with dimension (Kachuee et al., 2019).

A plausible implication is that such architectures are especially advantageous for large input spaces, long sequences, or high-resolution visual inputs, where dense all-to-all mixing is prohibitive or over-parameterized.

5. Empirical Performance and Domain Applications

Hierarchical grouped-MLP fusion has been applied and evaluated across several domains:

  • Tabular and Non-structured Data: GMLP attains state-of-the-art classification performance on tabular and mixed-feature datasets, automatically discovering expressive feature subsets and interactions (Kachuee et al., 2019).
  • Computer Vision:
    • gSwin (“gated MLP in shifted window hierarchy”) achieves higher accuracy than equivalently sized Swin Transformers on ImageNet-1K (+0.4%), COCO detection (+0.45 box-AP) and ADE20K (+1.8 mIoU), with ~20% fewer parameters (Go et al., 2022).
    • CS-Mixer demonstrates ImageNet-1k top-1 accuracy of 83.2% at 13.7 GFLOPs and 94M parameters (CS-Mixer-L), surpassing same-compute 2-axis mixers by >1.5% and matching or exceeding axial/H–C mixers with much lower compute (Cui et al., 2023).
  • Sequence Modeling and Efficient Attention: Butterfly-MLP and Butterfly Attention in Vision Transformers and LRA tasks provide sub-quadratic scaling (e.g., 20% lower memory, up to +2% accuracy versus dense attention) and unique ability to handle sequence lengths up to 16K+ (Sapkota et al., 2023).
  • Multimodal Fusion: Dense Multimodal Fusion (DMF) leverages hierarchically layered shared MLP fusion across modalities, resulting in richer joint representations and accelerated convergence, particularly when one modality is noisy or absent (Hu et al., 2018).

A plausible implication is that hierarchical grouped-MLP fusion enhances robustness and generalization, particularly in settings with heterogeneous input structure or multi-resolution dependencies.

6. Structural Attributes and Generalization Extensions

Hierarchical grouped-MLP fusion subsumes and generalizes numerous classical and contemporary ideas:

  • Permutation and Grouping Flexibility: The index-based groupings in Butterfly/Dimension Mixers are theoretically universal for invertible transformations, recovering the full connectivity of dense MLPs as group size and depth scale (Sapkota et al., 2023).
  • Multi-head and Multi-modality: Grouped mixing can be extended with multi-head gating (as in gSwin) or with task-specific grouping functions for multi-task or cross-modal learning (Go et al., 2022, Kachuee et al., 2019).
  • Shifted and Shift-free Grouping: Alternation of standard and cyclically shifted window partitions in vision ensures eventual full neighborhood interaction across spatial locations (Go et al., 2022).
  • Low-Rank and Dynamic Mixers: Low-rank grouped mixing (CS-Mixer) balances capacity and efficiency, with the rank dd and group size gg as critical hyperparameters for trade-offs in expressive power versus compute (Cui et al., 2023).
  • Generalization to Nonlinear and Hierarchical Domains: The same principles extend across domains—tabular, sequential, visual, and multimodal—with appropriate design of groupings, local mixing blocks, and hierarchical aggregation.

7. Representative Architectures: Comparison Table

Below is a summary contrasting representative hierarchical grouped-MLP fusion models and their key attributes:

Model Grouping Strategy Hierarchical Fusion Core Domain
GMLP (Kachuee et al., 2019) Learnable, sparse softmax groups Tree-like pooling, pooling after MLP Tabular, general
gSwin (Go et al., 2022) Non-overlapping and shifted windows Multi-stage spatial pyramid Vision
CS-Mixer (Cui et al., 2023) Local/global grouped, low-rank Four-stage cross-scale hierarchy Vision
Butterfly MLP (Sapkota et al., 2023) Radix-based, permuted groupings Log-depth permutation-mixing tree Sequence, vision
DMF (Hu et al., 2018) Dense shared MLP layers between modalities Hierarchical shared-layer stacking Multimodal fusion

Together, these architectures demonstrate the flexibility and effectiveness of hierarchical grouped-MLP fusion as a general deep learning design paradigm, with efficiency, expressive capacity, and domain adaptability at its core.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Grouped-MLP Fusion.