Flex-MA Branch: Adaptive Multimodal Fusion

Updated 6 January 2026

Flex-MA Branch is a set of dynamic modules integrating architectural and algorithmic strategies for efficient global context aggregation across multiple domains.
It leverages reconfigurable mechanisms such as masked bidirectional state-space models and alternating local-global processing to optimize multimodal fusion.
Empirical results demonstrate significant performance gains, including up to 59.8% improvement in ISAC systems and enhanced spatial coverage in wireless arrays.

The Flex-MA Branch refers to a class of architectural and algorithmic modules enabling flexible, high-efficiency global context aggregation or multimodal fusion across a variety of domains, including vision-LLMs, wireless beamforming, and modality-agnostic transformers. The concept leverages dynamically reconfigurable mechanisms (e.g., masked bidirectional state-space models, movable antenna placement, or unified fusion blocks) to address key constraints in scalability, inference flexibility, and adaptability for complex contexts.

1. Bidirectional State-Space Flex-MA in Multimodal Transformers

The most prominent recent instantiation of a Flex-MA Branch is within the M-MATE block of LinMU, a linear-complexity vision-LLM. Here, the Flex-MA branch is a masked bidirectional state-space model (SSM), built on the Mamba2 family, and designed for global mixing of multimodal token sequences with linear $\mathcal{O}(N)$ complexity (Wang et al., 4 Jan 2026).

Architectural Overview:

Each M-MATE block replaces Transformer self-attention with two submodules:
- Flex-MA branch (masked bidirectional SSM for global context)
- Local-Swin branch (windowed attention for local spatiotemporal correlation)
Input embeddings ${z_t}_{t=1}^N$ (text and vision) are rearranged via Rotary Major-Scan (RMS), passed through the bidirectional SSM, then inverse-permuted.

Mathematical Formulation:

Forward SSM pass:

$h_t^{\rightarrow} = A_t h_{t-1}^{\rightarrow} + B_t u_t; \quad y_t^{\rightarrow} = C_t h_t^{\rightarrow}$

Backward SSM with vision-mask:

h_t<sup>{\leftarrow}</sup> = \tilde{A}<em>t h</em>{t+1}<sup>{\leftarrow}</sup> + \tilde{B}_t (\mathcal{M}_V \odot u_t); \quad y_t<sup>{\leftarrow}</sup> = \tilde{C}_t h_t<sup>{\leftarrow}</sup></li> </ul> <p>

- Output gating and projection:

</p> <p>y_t<sup>{\mathrm{Flex}}</sup> = W_{O,\mathrm{Flex}} \left( g_t \odot h_t<sup>{\rightarrow}</sup> + (1 - g_t) \odot h_t<sup>{\leftarrow}</sup> \right)

where

g_t \in [0,1]<sup>d

is a per-token gate.

Complexity and Empirical Validation:

Runs two linear passes, yielding total cost $\mathcal{O}(N d^2)$ .
Ablation: Removing the Flex-MA branch causes catastrophic performance drop on long-context video tasks; it is essential for global information propagation.

Distillation and Weight Initialization:

Q/K/V/O projections from the original attention layer are mapped to the SSM's main projections.
Distillation first hits only the Flex-MA (hidden-state and token-level regression), then fuses with Local-Swin, followed by LoRA fine-tuning.

2. Joint Weight-Position Optimization for Movable Antenna Arrays

In flexible beam coverage for movable-antenna arrays, the Flex-MA methodology jointly optimizes analog beamforming weights $\omega$ and physical antenna positions $x$ within a continuous domain (Wang et al., 2024).

System Model:

$N$ isotropic antennas on a rail of length $D$ , with $x = [x_1, ..., x_N]^T$ , $0 \leq x_1 < ... < x_N \leq D$ .
Analog beamforming vector: $\omega = \frac{1}{\sqrt{N}}[e^{j\phi_1},...,e^{j\phi_N}]^T$
Gain toward angle $\theta$ : $G(\omega,x,\theta) = |\omega^H a(x, \theta)|^2$

Optimization Problem:

Maximize the minimum beam gain $t$ over a union of $I$ angular coverage regions:

$\max_{\omega, x, t} \ t; \quad \text{s.t. } |\omega^H a(x, \theta)|^2 \geq t, \ \forall\theta \in \bigcup_{i=1}^I \mathcal{E}_i$

Additional constraints: $0 \leq x_n \leq D$ , $x_{n} - x_{n-1} \geq d_{\min}$ , $|\omega(n)| = 1/\sqrt{N}$ .

Algorithm: AO + SCA

Alternating Optimization: alternate between beamweight (SDP, convex surrogate for $V=\omega\omega^H$ ) and position (QCQP, SCA on low-order Taylor expansions of trigonometric array response) updates.
Each optimization subproblem uses a convex formulation via surrogate approximation, converging under standard MM/EM majorization criteria.

Performance:

Substantial flattening of the beam profile across multiple spatial coverage regions.
Non-uniform and aperture-stretching optimal positions yield 3–6 dB gain over fixed-position arrays for large angular spans.

3. Fractional-Programming Flex-MA for ISAC (Integrated Sensing and Communication)

In ISAC systems, Flex-MA enables flexible array response reconfiguration by joint optimization of both digital beamformers and antenna positions (Lyu et al., 2024).

System and Formulation:

Dual-functional base station, $N$ movable transmit antennas, $K$ single-antenna users, bistatic radar sensing.
Weighted-sum objective: maximize $\omega_c \sum_{k=1}^K \mathrm{R}_k + \omega_s \mathrm{R}_s$ (multiuser rate + radar mutual information).
Subject to transmit power and placement constraints $x \in [X_{\mathrm{min}}, X_{\mathrm{max}}]^N$ , $|x_i - x_j| \geq D_0$ .

Solution:

Fractional Programming with Block Coordinate AO:
- KKT-based closed-form update for digital beamformers.
- Search-based Projected Gradient Ascent (SPGA) for positions.
Empirical results report up to 59.8% improvement over fixed-array baselines with the proposed Flex-MA SPGA solution.

4. Flex-MA Concepts in Modality-Agnostic Vision Transformers

The Flex-MA branch, as instantiated in the MA-ViT model, refers to a modality-agnostic, early-fusion single-branch framework for multi-modal token sequences (Liu et al., 2023).

Architecture:

Early fusion: Input images from all available modalities are patch-tokenized and embedded as a single token sequence.
A Transformer encoder alternates standard self-attention/MLP blocks (STB) with Modality-Agnostic Transformer Blocks (MATB), where MATB implements:
- Modal-Disentangle Attention (MDA): removes modality-specific information from class tokens.
- Cross-Modal Attention (CMA): fuses complementary features between modalities.
The entire pipeline operates on a single branch, obviating the need for modality-matched test configurations.

Inference Flexibility:

Any subset of input modalities can be processed at test time without retraining or rearchitecting.
Classified outputs (liveness, modality) are generated from a shared MLP head on the class token, with cross-modal information optionally fused via CMA.

5. Practical Implementation, Complexity, and Applications

Computational Complexity

Domain	Algorithmic Cost	Key Feature
LinMU Flex-MA SSM (Wang et al., 4 Jan 2026)	$\mathcal{O}(N d^2)$	Linear global mixing, no $N\times N$
Movable-Antenna Flex-MA (Wang et al., 2024)	Convex surrogate per AO iteration	Joint spatial-beam coverage
ISAC Flex-MA (Lyu et al., 2024)	Block AO, KKT/gradient updates	Optimized for comm+radar tradeoff
MA-ViT Flex-MA (Liu et al., 2023)	Standard ViT + MMDA/MCMA FLOPs	Flexible uni-/multi-modal input

Applications and Extensions

Long-context VLMs: Efficient global context capture without quadratic attention bottleneck (Wang et al., 4 Jan 2026).
Wide-area and multi-region wireless coverage: Dynamic reallocation of aperture capability for multi-region uniformity (Wang et al., 2024).
ISAC systems: Generative flexibility for the simultaneous satisfaction of multiuser rate and sensing MI tradeoffs (Lyu et al., 2024).
Face anti-spoofing/flexible multimodal visual recognition: Any-modality inference and cross-modal generalization (Liu et al., 2023).

6. Comparative Performance and Empirical Validation

LinMU (Flex-MA + Local-Swin): On LongVideoBench, the full model achieves 58.8% test accuracy. Flex-MA only: 51.3%, Local-Swin only: 30.2%—demonstrating indispensable global context provided by the Flex-MA branch (Wang et al., 4 Jan 2026).
Movable Antenna Arrays: Flex-MA yields flat ( $-0.5$ dB) coverage in single-region; up to $> -1$ dB in all regions for multi-region setups, significantly outperforming fixed-position arrays with deep coverage notches (Wang et al., 2024).
ISAC (SPGA-FBF): 4 movable antennas with the Flex-MA approach outperform 8 fixed antennas by 59.8% in the weighted sum objective for combined rate and mutual information (Lyu et al., 2024).
MA-ViT: Nearly matches or surpasses two-branch fusion models in benchmark face anti-spoofing tasks, with lower parameter count and full flexibility in test-time input configurations (Liu et al., 2023).

Across these domains, the Flex-MA Branch introduces a design paradigm—dynamic, modality-independent, and jointly optimized mechanisms—that achieve adaptive, global context reasoning at reduced computational cost and maximal flexibility. This methodology underpins advances in scalable multimodal models, deployable wireless systems, and modality-agnostic vision transformers.