LinMU: Linear-complexity Multimodal Understanding

Updated 6 January 2026

LinMU is a novel class of vision-language architectures that achieves linear computational scaling by replacing quadratic self-attention with state-space and dual-branch modules.
The framework employs progressive distillation and adapter-based fusion to maintain accuracy while supporting high-resolution images and long-context videos.
Empirical benchmarks demonstrate significant speedups and memory reductions, enabling efficient deployment in real-time and resource-constrained applications.

Linear-complexity Multimodal Understanding (LinMU) defines a family of vision-LLM (VLM) architectures and conversion frameworks that enable multimodal reasoning with strictly linear computational and memory complexity in sequence length. Conventional transformer-based VLMs are dominated by O(n²) self-attention, which hinders their scalability for high-resolution images and long-context video sequences. LinMU models eliminate all quadratic-complexity modules—either by architectural redesign or progressive distillation into linear modules—while preserving or matching the accuracy of their quadratic-attention predecessors on a range of multimodal benchmarks. LinMU methods now cover unified VLM inference, multimodal generation, fine-grained understanding, and efficient deployment, as established in recent works spanning dual-branch state-space models, liquid-state decoders, and RNN-based frameworks (Wang et al., 4 Jan 2026, Liao et al., 18 Feb 2025, Kang et al., 20 May 2025, Trinh et al., 14 Nov 2025).

1. Motivation: Bottlenecks of Quadratic Attention

Multimodal reasoning tasks—such as long-context video QA, scene-text understanding, and diagram interpretation—often require processing input sequences with n ≫ 10⁴ tokens. In transformer-based VLMs, each self-attention layer forms an n×n similarity matrix, imposing O(n²·d) per-layer FLOPs and O(n²) memory growth. This prevents efficient deployment on edge devices, real-time applications, and fine-grained reasoning with high-resolution data. LinMU addresses this bottleneck directly by ensuring all core blocks (sequence mixing, cross-modal fusion, decoding) scale only linearly with sequence length (Wang et al., 4 Jan 2026).

2. Architectural Principles of LinMU Models

LinMU exemplars span several architectural approaches, unified by their reliance on state-space modules and efficient mixing/fusion strategies:

State-Space Modules: Both the Mamba-2 and Liquid architectures instantiate discretized state-space recurrences of the form $x_t = A x_{t-1} + B u_t; y_t = C x_t + D u_t$ , which are implemented as causal convolutions or gated matrix multiplications. These modules yield O(T·d²) complexity and fixed-size memory footprints independently of sequence length (Zou et al., 11 Mar 2025, Liao et al., 18 Feb 2025, Trinh et al., 14 Nov 2025).
Dual-Branch Mixing (M-MATE Block): LinMU (Wang et al., 4 Jan 2026) replaces each self-attention layer with the M-MATE block—a dual-branch module combining global context (Flex-MA branch: bidirectional SSM) and local correlations (Local-Swin branch: Swin-style windowed attention). Fusion occurs via learned weighted summation and layer normalization, enabling both fine-grained and global reasoning.
Unified Autoregressive Decoding: Frameworks such as OmniMamba extend linear SSMs to unified next-token prediction—enabling both text and image generation through a shared backbone, with modality-specific output heads and decoupled vocabularies. This splits the output space into block-diagonal softmaxes, improving guidance and efficiency (Zou et al., 11 Mar 2025).
Adapter-Based Modality Fusion: ModRWKV (Kang et al., 20 May 2025) deploys lightweight modality adapters to map heterogeneous encoder features into a unified hidden dimension, followed by dynamic fusion via learned gating. Each adapter is a single-layer MLP trained with frozen backbones to accelerate convergence and maintain language priors.
Efficient Cross-Modal Fusion: Viper-F1 (Trinh et al., 14 Nov 2025) uses a single token-grid correlation pass across text and visual grids, feeding context into FiLM-style conditioning branches before SSM decoding. Cross-modal state modulation ensures fine-grained grounding with O(T) overhead.

3. Training and Distillation Methodologies

Many LinMU models are created by converting a pre-trained quadratic-attention VLM (teacher) into a linear-complexity student via multi-stage distillation and parameter transfer:

Three-Stage Distillation Protocol: For LinMU (Wang et al., 4 Jan 2026) and mmMamba (Liao et al., 18 Feb 2025), the conversion proceeds by (i) initializing state-space or mixer block weights directly from Q,K,V,O projections of the teacher; (ii) training only the global (e.g., SSM) branch to mimic teacher layer outputs via hidden-state MSE and token-level KL objectives; (iii) unfreezing local mixing branches and adapter layers for joint tuning, regressing on both intermediate representations and output logits; (iv) end-to-end fine-tuning with LoRA adapters across all sublayers for output sequence consistency.
Adapter Warm-up and Joint Fine-Tuning: ModRWKV (Kang et al., 20 May 2025) performs an initial phase with all backbones frozen, updating only adapters and fusion gates, then a second phase where both adapters and the linear backbone are fine-tuned with reduced learning rates.
Hybrid Layering and Seeding: mmMamba-hybrid interleaves Transformer and Mamba-2 SSM layers at fixed intervals, preserving key quadratic-attention anchors for accuracy recovery, while mmMamba-linear pursues full-SSM conversion for maximum efficiency (Liao et al., 18 Feb 2025).

4. Computational Complexity and Scalability

All LinMU models formally guarantee per-layer and end-to-end complexity that is linear in sequence length $n$ (O(n·d²)), fundamentally breaking the quadratic scaling of conventional self-attention mechanisms:

Model / Block	Core Mixing	Complexity	State Size
Transformer	Self-Attn	O(n²·d)	O(n·d)
Mamba-2 SSM	SSM	O(n·d²)	O(d²)
RWKV7	RNN / SSM	O(n·d²)	O(d)
M-MATE (LinMU)	SSM + Local	O(n·d²)	O(d²)
Viper-F1 (Liquid)	SSM	O(n·d²)	O(d²)

Empirical throughput curves for LinMU, mmMamba, ModRWKV, and Viper-F1 demonstrate nearly linear token throughput and TTFT (Time-To-First-Token), with speedups from 2.7× (video QA, LinMU) up to 20.6× (mmMamba-linear, 103K tokens) against quadratic baselines (Wang et al., 4 Jan 2026, Liao et al., 18 Feb 2025, Trinh et al., 14 Nov 2025, Kang et al., 20 May 2025).

5. Empirical Performance and Benchmarking

LinMU and affiliated linear-architecture models have been numerically validated across a spectrum of multimodal tasks:

LinMU (8B, NVILA Teacher):
- MMMU test: 44.6% (LinMU) vs 44.4% (teacher)
- TextVQA: 79.3% vs 80.1%
- LongVideoBench test: 58.8% vs 58.7%
- Video-MME: 70.1% vs 70.0%
- TTFT speedup: up to 2.7×; throughput: up to 9.0× improvement (Wang et al., 4 Jan 2026)
mmMamba-linear / hybrid:
- Speedup: 20.6× (linear, 103K tokens), 13.5× (hybrid)
- Memory reduction: 75.8% (linear), 60.2% (hybrid)
- Multimodal QA benchmarks: matches or exceeds performance of other linear VL models, minus a small drop vs pure Transformer on hardest sets; hybrid recovers most of the gap (Liao et al., 18 Feb 2025)
ModRWKV-3B:
- VQA-v2: 78.3%, GQA: 60.8%, TextVQA: 51.1%, POPE: 87.1%
- Latency: 0.05 ms/token; throughput: ~20K tokens/s (NVIDIA A800, 512 tokens) (Kang et al., 20 May 2025)
Viper-F1 (0.8B):
- VQAv2: 76.6%, AI2D: 46.2%, MMMU_val: 26.4%, MME_p: 1376.2, highest perception scores among <3B models
- Latency: 40.08 ms; throughput: 46.67 tokens/s (Trinh et al., 14 Nov 2025)

Ablation studies confirm the necessity of both global and local branches in M-MATE blocks (LinMU): removing either results in severe drops on fine-grained or contextual benchmarks.

6. Extensions, Applications, and Limitations

LinMU models have expanded the practicality of multimodal reasoning to real-time and resource-constrained scenarios:

Edge Deployment: Linear time and memory complexity allow execution on device-level GPUs and support minute-length video inputs, 4K images, and diagram reasoning in robotics and smart cameras (Wang et al., 4 Jan 2026, Trinh et al., 14 Nov 2025).
Modality Coverage: Frameworks span text-image understanding, image generation (OmniMamba), real-time VQA, audio/time-series fusion (ModRWKV), and fine-grained visual grounding (Viper-F1).
Limitations: Several models remain reliant on initial quadratic-complexity teacher models; generative tasks via direct cross-attention linearization are ongoing research. Most small linear models support single-image as input; multi-frame or video sequence support is less mature.

Planned directions include zero-teacher or native SSM block training (Wang et al., 4 Jan 2026, Liao et al., 18 Feb 2025), tri-modal extension to speech+vision+language (Kang et al., 20 May 2025), hardware/software co-design for microsecond-level inference, and broader application domains beyond VQA.

7. Summary and Impact

LinMU represents a rigorous, empirically-validated advance in multimodal machine understanding. Through architectural innovation (state-space modeling, dual-branch mixing, adapter-based fusion) and progressive teacher-student distillation, LinMU frameworks fundamentally break the quadratic complexity barrier, enabling scalable multimodal reasoning across previously intractable contexts. The approach now underlies multiple competitive benchmarks in visual QA, diagram understanding, video QA, and modular generation, providing new baselines and research avenues for efficient multimodal learning (Wang et al., 4 Jan 2026, Liao et al., 18 Feb 2025, Kang et al., 20 May 2025, Trinh et al., 14 Nov 2025, Zou et al., 11 Mar 2025).