Local–Global Dual Attention

Updated 7 February 2026

Local–Global Dual Attention is a neural mechanism that combines spatially localized feature extraction with global context aggregation for improved recognition.
It employs parallel local convolutions and global self-attention to balance fine feature discrimination with holistic semantic integration.
It adapts to various architectures by fusing multi-scale features using learnable weights, enhancing accuracy with minimal computational cost.

Local–Global Dual Attention (LGA) refers to a class of neural attention mechanisms that explicitly integrate both fine-grained, spatially local information and broad, long-range contextual signals within a unified architectural module. Unlike traditional attention or convolutional approaches that focus exclusively on either local or global dependencies, LGA modules are designed to simultaneously optimize for detailed discrimination and holistic scene understanding. These mechanisms are systematically instantiated across deep learning domains, including computer vision, language, speech, time series, and multi-modal recognition tasks.

1. Conceptual Basis and Motivation

Local–Global Dual Attention mechanisms are predicated on the observation that many pattern recognition tasks—particularly object detection, scene parsing, and recognition under challenging conditions—require models to reason about both spatially restricted, high-resolution features (for object boundaries, textures, or fine details) and global context (for semantic disambiguation, long-range dependencies, or contextual modulation).

Conventional architectures often suffer from a trade-off: convolutional or local windowed attention captures short-range interactions but misses broader context, while global self-attention is computationally expensive and can dilute discriminative local cues. LGA mechanisms mitigate this by parallelizing or fusing both modes of attention within each processing stage, balancing their influence through adaptive weighting schemes (Shao, 2024, Yu et al., 2024, Song et al., 2021, Lou et al., 2023).

2. Canonical Architectures and Mathematical Formulation

A prototypical Local–Global Dual Attention module takes as input a multi-channel feature map $\mathbf{X}\in\mathbb{R}^{B\times D\times H\times W}$ and produces a refined feature map with joint local–global context. The steps generally include:

Local Pathway: Generates multi-scale features using depthwise convolutions with small kernels ( $k\in\{3,5,7\}$ ), possibly across several heads. Attention weights per scale are predicted (e.g., via $1\times1$ convolutions and softmax), yielding fused multi-scale representations. Optionally, positional encodings are injected to modulate the local representation spatially.
Global Pathway: Employs large-kernel or global self-attention (~full receptive field) to aggregate contextual information, often via efficient approximations—such as dynamic token mixers (Lou et al., 2023) or pooling/aggregation strategies (Song et al., 2021, Sheynin et al., 2021). Queries, keys, and values are computed for every spatial location, and global dot-product attention is performed, typically with bias terms or position encodings to preserve geometric awareness.
Adaptive Fusion: Local and global outputs are fused using learnable scalar weights or per-channel vectors ( $\alpha_{\mathrm{local}}, \alpha_{\mathrm{global}}$ ) that are dynamically updated alongside network parameters (Shao, 2024). This allows data-driven adjustment of the importance of local versus global features on a per-task or per-dataset basis.

Mathematically, fusion takes the form

$\mathbf{out} = \alpha_{\mathrm{local}}\cdot \mathrm{local\_out} + \alpha_{\mathrm{global}}\cdot \mathrm{global\_out}$

followed by a $1\times1$ convolution for channel compression.

For more structured signals (graphs, language), variants implement dual attention via parallel RNN/Transformer, GNN/self-attention, or windowed/global modules, with concatenation or learnable gating at fusion (Li et al., 2023, Wang et al., 18 Sep 2025, Niu et al., 2023).

3. Instantiations Across Modalities and Networks

LGA designs are highly modular and have been adapted to diverse domains:

CNN Backbones: LGA modules are interleaved after major feature-extraction blocks (e.g., in MobileNetV3, ResNet18, YOLOv8) for detection and classification tasks, providing consistent gains in mAP and classification accuracy with negligible computational overhead (Shao, 2024, Song et al., 2022, Muhammad et al., 5 Jun 2025).
Transformer-based Backbones: Window-based (local) and global or cross-window (global) attention are composed sequentially or in parallel, as in DaViT (Ding et al., 2022), Focal Transformer (Yang et al., 2021), and Locally Shifted Attention (Sheynin et al., 2021). Channel and spatial attention complement patch-level and channel-wise mixing.
Graph Transformers: LGA appears as shallow–global/ deep–local interleaving—the G2LFormer architecture demonstrates that a single global self-attention layer followed by several local GNN layers, with cross-layer fusion, outperforms parallel or local-to-global layouts (Wang et al., 18 Sep 2025).
Time Series, BCI, Multimodal: LGA is realized through modules such as Squeeze-Excitation Window Attention (local) and sparse or shifted self-attention (global) (Farahani et al., 2023), or in BCI via anatomical-region local encoders with regional/global transformers (Wang et al., 25 Aug 2025).
Specialized Tasks: Infrared small target detection (Zuo et al., 25 Sep 2025), RGB-D saliency (Yi et al., 3 Jan 2025), and fine-grained categorization (Zhu et al., 2022) incorporate LGA in dual-path, cross-modal, or cross-attention paradigms.

4. Comparative Effectiveness and Empirical Evidence

Extensive benchmarks confirm the empirical utility of LGA. On object detection and image classification tasks, LGA outperforms both local-only (LA), global-only (GA), and popular attention baselines (MHSA, SE, CBAM, ECA) by 0.1–0.7 mAP50 and mAP50–95 with almost no increase in FLOPs or parameters (Shao, 2024). In face recognition, feature-norm–weighted fusion of local (MHMS) and global (GFE) streams yields up to +0.65% accuracy on verification and +6.28% rank-1 on low-resolution retrieval benchmarks (Yu et al., 2024).

Ablation studies consistently show that removing either the local or global stream appreciably degrades performance. For instance, in DENet (Zuo et al., 25 Sep 2025), disabling global self-attention reduces mIoU from 83.96%→76.85%, while removing local self-attention yields 78.41%, confirming the strict complementarity of both paths.

In time series classification (DA-Net (Farahani et al., 2023)), the combination of SEWA (local) and SSAW (sparse global) yields state-of-the-art average classification accuracy (72.4%) and the best mean per-class error (1.391), outperforming all single-branch and multi-scale baselines.

5. Structural Variants and Adaptive Mechanisms

LGA implementations differ in the form and sequencing of the dual pathways:

Parallel vs. Sequential: Some modules process local and global attention in parallel with subsequent additive or weighted fusion (Shao, 2024, Yi et al., 3 Jan 2025), while others arrange sequential local→global (or global→local (Wang et al., 18 Sep 2025)) processing with interleaved fusion.
Token/Channel Grouping: Architectures such as DaViT (Ding et al., 2022) and GLAM (Song et al., 2021) perform dual self-attention along spatial windows and channel groups, carefully partitioning tokens to maintain linear complexity while maximizing context integration.
Adaptive Weighting: Many variants use learnable scalar or vector fusion parameters, e.g., $\alpha_{\mathrm{local}}$ , $\alpha_{\mathrm{global}}$ , or attention-quality–derived softmax weights. These parameters converge automatically via backpropagation, with no need for explicit tuning (Shao, 2024, Yu et al., 2024).
Positional Encoding: Explicit or learnable positional encodings are injected prior to attention computations, increasing discrimination between similar but spatially distinct features, and improving performance in tasks dependent on absolute position (Shao, 2024, Yang et al., 2021, Nguyen et al., 2024).

6. Domain-Specific Principles and Interactions

Across tasks, LGA mechanisms adapt to domain structure:

Multi-modal fusion (RGB-D, BCI): Dual-attention blocks are conditioned to perform cross-modal mutual learning, recalibrating both modality-specific and shared representations by joint spatial and channel-wise attention (Yi et al., 3 Jan 2025, Wang et al., 25 Aug 2025).
Cross-Attention and Bidirectionality: In some networks (DENet (Zuo et al., 25 Sep 2025), DCAL (Zhu et al., 2022)), bidirectional or cross-attention operations allow separate streams (e.g., edge, semantic; local, global) to inform each other via mutual querying before fusion.
Hierarchical Integration: LGA is stacked at multiple depths; early layers favor local detail, deeper layers progressively increase receptive field and reliance on global context (Sheynin et al., 2021, Fan et al., 27 Jul 2025).
Attention-Type Diversity: Systems like GLAM employ all four axes—local/global and channel/spatial—through independent but coordinated attention maps, each contributing distinct discrimination and robustness (Song et al., 2021).

7. Practical Considerations and Impact

Across empirical tasks, LGA adds minimal computational burden (typically +0.01–0.02 million parameters, no increase in FLOPs (Shao, 2024)), with gains especially pronounced in multi-scale, fine-grained detection, degraded image recognition, and complex spatial or semantic structure scenarios. The modularity and “plug-and-play” nature of LGA enables direct integration into existing CNN/Transformer backbones.

Qualitative analysis of attention maps reveals that local branches focus on boundaries and fine structure (object edges, salient regions), while global branches suppress background clutter, enforce semantic coherence, and help distinguish objects that are ambiguous in local appearance alone (Zuo et al., 25 Sep 2025, Nguyen et al., 2024).

In summary, Local–Global Dual Attention has emerged as a unifying architectural paradigm that delivers strict gains in both expressive power and accuracy across a broad range of challenging pattern recognition tasks, confirming its centrality in current and next-generation deep learning models (Shao, 2024, Yu et al., 2024, Yang et al., 2021, Lou et al., 2023, Wang et al., 18 Sep 2025, Fan et al., 27 Jul 2025, Zuo et al., 25 Sep 2025, Yi et al., 3 Jan 2025, Ding et al., 2022, Song et al., 2021).