IEMA: Efficient Multi-Scale Attention Module

Updated 29 January 2026

The paper shows that IEMA significantly improves feature recalibration using multi-branch local and global attention, boosting small object detection accuracy.
IEMA’s architecture splits feature maps into groups and applies parallel depthwise convolutions for diverse receptive field extraction while minimizing computational cost.
Empirical results demonstrate that IEMA enhances various vision tasks—including detection, segmentation, classification—and LLM inference with negligible parameter growth.

An Improved Efficient Multi-scale Attention Module (IEMA) is an architectural enhancement for neural networks, particularly in vision and object detection, designed to efficiently capture dependencies across multiple spatial and channel scales with minimal computational overhead. IEMA builds directly on the foundations of Efficient Multi-Scale Attention (EMA) modules, introducing multi-branch parallelism, cross-spatial/global attention mechanisms, and refined attention mapping between different model scales. Its goal is to amplify essential semantic features, especially for small object detection and recognition in challenging multimodal or long-range contexts, while controlling parameter growth and FLOP count. Several studies have independently converged on IEMA designs for use in detection backbones, segmentation, classification, and efficient inference in both vision and LLMs (Ouyang et al., 2023, Lu et al., 25 Apr 2025, Xie et al., 16 Oct 2025, Zhao et al., 16 Jul 2025, Agrawal et al., 16 Mar 2025, Shah et al., 23 Jun 2025, Shang et al., 2023).

1. Architectural Principles and Variants

IEMA generalizes the attention mechanism to act both within and across scales and feature groups. The canonical structure includes:

Feature Grouping: The input tensor $X \in \mathbb{R}^{C \times H \times W}$ or its variants is split along the channel dimension into $G$ groups; each group $X_i \in \mathbb{R}^{C/G \times H \times W}$ is processed independently, enabling parallel, light-weight operations.
Parallel Multi-scale Local Attention: Within each group, parallel depthwise separable convolutions of varying kernel sizes (e.g., $3 \times 3$ , $1 \times 5$ , $5 \times 1$ ), plus an identity path, are used to extract features at a diversity of receptive fields. The outputs are concatenated and fused by a pointwise convolution and squashing function (often Sigmoid) to produce a per-group attention map.
Cross-Spatial (Global) Attention: For each group, global context is modeled by channelwise averaging to produce a $1 \times H \times W$ map. This is separated via softmax operations over rows and columns to yield spatial masks, which are recombined through matrix multiplications to produce global attention maps sensitive to object shape and position.
Re-weighting and Aggregation: The local and global attention maps modulate each group’s feature slice, and outputs are summed. The $G$ re-weighted groups are concatenated to reconstruct the full channel dimension.

Variant IEMA structures include spectral-domain convolutions for hyperspectral segmentation (Shah et al., 23 Jun 2025), Transformer-based global modeling interfacing multi-scale feature sets (Xie et al., 16 Oct 2025), and attention-matrix mapping across model scales for LLM inference acceleration (Zhao et al., 16 Jul 2025). These variants inherit the core paradigm of leveraging multi-scale and cross-location context.

2. Mathematical Formulation and Algorithmic Flow

For a group-wise IEMA as presented in MASF-YOLO (Lu et al., 25 Apr 2025):

Given $X \in \mathbb{R}^{C \times H \times W}$ , and letting $G$ be the number of groups, decompose into $G$ 0 for $G$ 1.

Local attention (for each $G$ 2):

$G$ 3

where $G$ 4 is a sigmoid function.

Global cross-spatial attention:

$G$ 5

Re-weighted output:

$G$ 6

where $G$ 7 denotes elementwise multiplication.

Final aggregation:

$G$ 8

Algorithmic flows for transformer-based IEMA (Xie et al., 16 Oct 2025) additionally include cross-layer attention score computation and partitioned self-attention for complexity reduction.

3. Integration in Vision and Detection Frameworks

IEMA modules are typically inserted at two key locations:

Backbone blocks: Following feature extraction units to enhance early/mid-level representations, especially after multi-scale context aggregation modules or residual modules.
Neck and fusion layers: Before upsampling/downsampling or feature concatenation, enabling the module to filter and recalibrate features before multi-scale fusion.

In MASF-YOLO (Lu et al., 25 Apr 2025), IEMA is used after every MFAM (Multi-scale Feature Aggregation Module) in the backbone and before every fusion in the neck, thereby establishing dense, scale-aware re-weighting at all hierarchy levels. In CFSAM for SSD300 (Xie et al., 16 Oct 2025), an analogous self-attention module operates across all pyramid levels immediately before the prediction heads, with an explicit transformer partition/fusion mechanism for cross-scale context modeling.

4. Computational Efficiency and Complexity Analysis

IEMA’s design prioritizes minimal computational overhead:

Parameter Growth: Adding IEMA typically incurs $G$ 9M parameters per insertion (MASF-YOLO IEMA: $X_i \in \mathbb{R}^{C/G \times H \times W}$ 0M vs $X_i \in \mathbb{R}^{C/G \times H \times W}$ 1M for vanilla EMA) (Lu et al., 25 Apr 2025, Ouyang et al., 2023).
FLOPs: Empirically measured increases in GFLOPs are negligible ( $X_i \in \mathbb{R}^{C/G \times H \times W}$ 2GFLOPs per module), due to the use of grouped and depthwise convolutions and avoidance of large dense matrix multiplications.
Scalability: When IEMA variants are used in high-resolution contexts with windowing or group splitting (e.g., Atlas (Agrawal et al., 16 Mar 2025)), the overall per-layer cost is $X_i \in \mathbb{R}^{C/G \times H \times W}$ 3 compared to $X_i \in \mathbb{R}^{C/G \times H \times W}$ 4 for global self-attention, with $X_i \in \mathbb{R}^{C/G \times H \times W}$ 5 the number of tokens/spatial positions.
Memory footprint: For efficient inference in LLMs, mapping attention heads between scales with IEMA-style techniques reduces KV cache usage by 22.1% and accelerates prefill by 15% (Zhao et al., 16 Jul 2025).

5. Empirical Gains and Application Scenarios

Detection and segmentation:

IEMA in MASF-YOLO (Lu et al., 25 Apr 2025) yields $X_i \in \mathbb{R}^{C/G \times H \times W}$ 6 mAP@0.5 (from 48.3 to 48.8) and $X_i \in \mathbb{R}^{C/G \times H \times W}$ 7 [email protected]:0.95, with negligible model size increase.
In UNet-MSAM (a spectral/1D variant), employing IEMA-like modules in skip connections improves mean IoU by $X_i \in \mathbb{R}^{C/G \times H \times W}$ 8 and mF1 by $X_i \in \mathbb{R}^{C/G \times H \times W}$ 9, with only $3 \times 3$ 0 more parameters (Shah et al., 23 Jun 2025).

Classification and large-scale vision:

In studies using ResNet-50/101 and various mobile networks, IEMA routinely provides +1.6% to +4% Top-1 accuracy bumps over baselines and other attention modules, outperforming SE, CBAM, CA, etc. (Ouyang et al., 2023).
On Atlas (Agrawal et al., 16 Mar 2025), multi-scale attention blocks deliver up to $3 \times 3$ 1 throughput gains and strong Top-1 accuracy at very high resolutions.

Efficient LLM inference:

The IAM-based IEMA mapping approach shows compute reduction ( $3 \times 3$ 2 prefill time), KV cache cut ( $3 \times 3$ 3), and negligible accuracy loss (within $3 \times 3$ 4 log-perplexity gap at 30% head mapping ratio) across heterogeneous model families (Zhao et al., 16 Jul 2025).

6. Design Optimizations and Generalization

IEMA variants include the following optimizations and extensions:

Grouped and depthwise convolutions: To limit cost, all parallel convolutions operate within the group rather than across the full channel axis.
Channel and spatial dimension normalization: Combining instance/batch normalization, SiLU/LeakyReLU activations, and per-axis softmax facilitate distributed attention without dense compute.
Attention mapping and cross-modal scalability: For LLMs and large-scale ViTs, IEMA principles extend to mapping attention scores from smaller to larger models, as well as cross-stage and cross-layer fusions (Zhao et al., 16 Jul 2025, Xie et al., 16 Oct 2025).
Ablation evidence: Disabling cross-spatial or multi-path branches consistently degrades accuracy by up to 2.5–3% absolute, underscoring the necessity of both local and global pathways.

7. Relationship to Adjacent Multi-Scale Attention Modules

IEMA is situated among several related modules:

EMA (Efficient Multi-scale Attention): The baseline, with two-branch local/global attention per group and cross-spatial learning (Ouyang et al., 2023).
MSCSA (Multi-Stage Cross-Scale Attention): Stage-level fusions over pooled features, cross-scale dot-product attention, and intra-stage feedforward design (Shang et al., 2023).
CFSAM, MSA, MSAM, IAM: Further extensions, which replace or augment local convs with transformers, spectral kernels, or cross-scale attention mapping.

These modules are generally plug-compatible, with IEMA representing a refined, parameter-efficient, and generalizable variant specifically validated for detection, segmentation, and LLM acceleration across hardware regimes (Agrawal et al., 16 Mar 2025, Shang et al., 2023, Xie et al., 16 Oct 2025, Zhao et al., 16 Jul 2025).

References:

(Ouyang et al., 2023) Efficient Multi-Scale Attention Module with Cross-Spatial Learning
(Lu et al., 25 Apr 2025) MASF-YOLO: An Improved YOLOv11 Network for Small Object Detection on Drone View
(Zhao et al., 16 Jul 2025) IAM: Efficient Inference through Attention Mapping between Different-scale LLMs
(Agrawal et al., 16 Mar 2025) Atlas: Multi-Scale Attention Improves Long Context Image Modeling
(Shah et al., 23 Jun 2025) Multi-Scale Spectral Attention Module-based Hyperspectral Segmentation in Autonomous Driving Scenarios
(Shang et al., 2023) Vision Backbone Enhancement via Multi-Stage Cross-Scale Attention
(Xie et al., 16 Oct 2025) Cross-Layer Feature Self-Attention Module for Multi-Scale Object Detection