Papers
Topics
Authors
Recent
Search
2000 character limit reached

IEMA: Efficient Multi-Scale Attention Module

Updated 29 January 2026
  • The paper shows that IEMA significantly improves feature recalibration using multi-branch local and global attention, boosting small object detection accuracy.
  • IEMA’s architecture splits feature maps into groups and applies parallel depthwise convolutions for diverse receptive field extraction while minimizing computational cost.
  • Empirical results demonstrate that IEMA enhances various vision tasks—including detection, segmentation, classification—and LLM inference with negligible parameter growth.

An Improved Efficient Multi-scale Attention Module (IEMA) is an architectural enhancement for neural networks, particularly in vision and object detection, designed to efficiently capture dependencies across multiple spatial and channel scales with minimal computational overhead. IEMA builds directly on the foundations of Efficient Multi-Scale Attention (EMA) modules, introducing multi-branch parallelism, cross-spatial/global attention mechanisms, and refined attention mapping between different model scales. Its goal is to amplify essential semantic features, especially for small object detection and recognition in challenging multimodal or long-range contexts, while controlling parameter growth and FLOP count. Several studies have independently converged on IEMA designs for use in detection backbones, segmentation, classification, and efficient inference in both vision and LLMs (Ouyang et al., 2023, Lu et al., 25 Apr 2025, Xie et al., 16 Oct 2025, Zhao et al., 16 Jul 2025, Agrawal et al., 16 Mar 2025, Shah et al., 23 Jun 2025, Shang et al., 2023).

1. Architectural Principles and Variants

IEMA generalizes the attention mechanism to act both within and across scales and feature groups. The canonical structure includes:

  • Feature Grouping: The input tensor XRC×H×WX \in \mathbb{R}^{C \times H \times W} or its variants is split along the channel dimension into GG groups; each group XiRC/G×H×WX_i \in \mathbb{R}^{C/G \times H \times W} is processed independently, enabling parallel, light-weight operations.
  • Parallel Multi-scale Local Attention: Within each group, parallel depthwise separable convolutions of varying kernel sizes (e.g., 3×33 \times 3, 1×51 \times 5, 5×15 \times 1), plus an identity path, are used to extract features at a diversity of receptive fields. The outputs are concatenated and fused by a pointwise convolution and squashing function (often Sigmoid) to produce a per-group attention map.
  • Cross-Spatial (Global) Attention: For each group, global context is modeled by channelwise averaging to produce a 1×H×W1 \times H \times W map. This is separated via softmax operations over rows and columns to yield spatial masks, which are recombined through matrix multiplications to produce global attention maps sensitive to object shape and position.
  • Re-weighting and Aggregation: The local and global attention maps modulate each group’s feature slice, and outputs are summed. The GG re-weighted groups are concatenated to reconstruct the full channel dimension.

Variant IEMA structures include spectral-domain convolutions for hyperspectral segmentation (Shah et al., 23 Jun 2025), Transformer-based global modeling interfacing multi-scale feature sets (Xie et al., 16 Oct 2025), and attention-matrix mapping across model scales for LLM inference acceleration (Zhao et al., 16 Jul 2025). These variants inherit the core paradigm of leveraging multi-scale and cross-location context.

2. Mathematical Formulation and Algorithmic Flow

For a group-wise IEMA as presented in MASF-YOLO (Lu et al., 25 Apr 2025):

Given XRC×H×WX \in \mathbb{R}^{C \times H \times W}, and letting GG be the number of groups, decompose into XiRC/G×H×WX_i \in \mathbb{R}^{C/G \times H \times W} for i=1Gi = 1 \ldots G.

Local attention (for each XiX_i):

P1=DWConv3×3(Xi) P2=DWConv1×5(Xi) P3=DWConv5×1(Xi) P4=Xi Si=Concat[P1,P2,P3,P4] Ui=Conv1×1(Si) AiL=σ(Ui)\begin{align*} P_1 &= \mathrm{DWConv}_{3 \times 3}(X_i) \ P_2 &= \mathrm{DWConv}_{1 \times 5}(X_i) \ P_3 &= \mathrm{DWConv}_{5 \times 1}(X_i) \ P_4 &= X_i \ S_i &= \text{Concat}[P_1, P_2, P_3, P_4] \ U_i &= \mathrm{Conv}_{1 \times 1}(S_i) \ A^L_i &= \sigma(U_i) \end{align*}

where σ\sigma is a sigmoid function.

Global cross-spatial attention:

Si(c=1)=1C/Gk=1C/GXi[k,:,:] v=reshape(Si,H×W) a=SoftmaxH(v) b=SoftmaxW(v) M=reshape(a,H,1)reshape(b,1,W) AiG=σ(M) (broadcast over channels) \begin{align*} S_i(c=1) &= \frac{1}{C/G}\sum_{k=1}^{C/G} X_i[k,:,:] \ v &= \mathrm{reshape}(S_i, H \times W) \ a &= \mathrm{Softmax}_H(v) \ b &= \mathrm{Softmax}_W(v) \ M &= \mathrm{reshape}(a, H,1) \cdot \mathrm{reshape}(b,1,W) \ A^G_i &= \sigma(M) \text{ (broadcast over channels) } \end{align*}

Re-weighted output:

Yi=XiAiL+XiAiGY_i = X_i \odot A^L_i + X_i \odot A^G_i

where \odot denotes elementwise multiplication.

Final aggregation:

Y=Concat(Y1,,YG)RC×H×WY = \text{Concat}(Y_1, \ldots, Y_G) \in \mathbb{R}^{C \times H \times W}

Algorithmic flows for transformer-based IEMA (Xie et al., 16 Oct 2025) additionally include cross-layer attention score computation and partitioned self-attention for complexity reduction.

3. Integration in Vision and Detection Frameworks

IEMA modules are typically inserted at two key locations:

  • Backbone blocks: Following feature extraction units to enhance early/mid-level representations, especially after multi-scale context aggregation modules or residual modules.
  • Neck and fusion layers: Before upsampling/downsampling or feature concatenation, enabling the module to filter and recalibrate features before multi-scale fusion.

In MASF-YOLO (Lu et al., 25 Apr 2025), IEMA is used after every MFAM (Multi-scale Feature Aggregation Module) in the backbone and before every fusion in the neck, thereby establishing dense, scale-aware re-weighting at all hierarchy levels. In CFSAM for SSD300 (Xie et al., 16 Oct 2025), an analogous self-attention module operates across all pyramid levels immediately before the prediction heads, with an explicit transformer partition/fusion mechanism for cross-scale context modeling.

4. Computational Efficiency and Complexity Analysis

IEMA’s design prioritizes minimal computational overhead:

  • Parameter Growth: Adding IEMA typically incurs <0.05<0.05M parameters per insertion (MASF-YOLO IEMA: +0.01+0.01M vs $0.05$M for vanilla EMA) (Lu et al., 25 Apr 2025, Ouyang et al., 2023).
  • FLOPs: Empirically measured increases in GFLOPs are negligible (<0.1<0.1GFLOPs per module), due to the use of grouped and depthwise convolutions and avoidance of large dense matrix multiplications.
  • Scalability: When IEMA variants are used in high-resolution contexts with windowing or group splitting (e.g., Atlas (Agrawal et al., 16 Mar 2025)), the overall per-layer cost is O(NlogN)O(N \log N) compared to O(N2)O(N^2) for global self-attention, with NN the number of tokens/spatial positions.
  • Memory footprint: For efficient inference in LLMs, mapping attention heads between scales with IEMA-style techniques reduces KV cache usage by 22.1% and accelerates prefill by 15% (Zhao et al., 16 Jul 2025).

5. Empirical Gains and Application Scenarios

Detection and segmentation:

Classification and large-scale vision:

  • In studies using ResNet-50/101 and various mobile networks, IEMA routinely provides +1.6% to +4% Top-1 accuracy bumps over baselines and other attention modules, outperforming SE, CBAM, CA, etc. (Ouyang et al., 2023).
  • On Atlas (Agrawal et al., 16 Mar 2025), multi-scale attention blocks deliver up to 4.3×4.3\times throughput gains and strong Top-1 accuracy at very high resolutions.

Efficient LLM inference:

  • The IAM-based IEMA mapping approach shows compute reduction (15%-15\% prefill time), KV cache cut (22.1%-22.1\%), and negligible accuracy loss (within $0.01$ log-perplexity gap at 30% head mapping ratio) across heterogeneous model families (Zhao et al., 16 Jul 2025).

6. Design Optimizations and Generalization

IEMA variants include the following optimizations and extensions:

  • Grouped and depthwise convolutions: To limit cost, all parallel convolutions operate within the group rather than across the full channel axis.
  • Channel and spatial dimension normalization: Combining instance/batch normalization, SiLU/LeakyReLU activations, and per-axis softmax facilitate distributed attention without dense compute.
  • Attention mapping and cross-modal scalability: For LLMs and large-scale ViTs, IEMA principles extend to mapping attention scores from smaller to larger models, as well as cross-stage and cross-layer fusions (Zhao et al., 16 Jul 2025, Xie et al., 16 Oct 2025).
  • Ablation evidence: Disabling cross-spatial or multi-path branches consistently degrades accuracy by up to 2.5–3% absolute, underscoring the necessity of both local and global pathways.

7. Relationship to Adjacent Multi-Scale Attention Modules

IEMA is situated among several related modules:

  • EMA (Efficient Multi-scale Attention): The baseline, with two-branch local/global attention per group and cross-spatial learning (Ouyang et al., 2023).
  • MSCSA (Multi-Stage Cross-Scale Attention): Stage-level fusions over pooled features, cross-scale dot-product attention, and intra-stage feedforward design (Shang et al., 2023).
  • CFSAM, MSA, MSAM, IAM: Further extensions, which replace or augment local convs with transformers, spectral kernels, or cross-scale attention mapping.

These modules are generally plug-compatible, with IEMA representing a refined, parameter-efficient, and generalizable variant specifically validated for detection, segmentation, and LLM acceleration across hardware regimes (Agrawal et al., 16 Mar 2025, Shang et al., 2023, Xie et al., 16 Oct 2025, Zhao et al., 16 Jul 2025).


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Improved Efficient Multi-scale Attention Module (IEMA).