Papers
Topics
Authors
Recent
Search
2000 character limit reached

Grouped Coordinate Attention Module

Updated 5 January 2026
  • Grouped Coordinate Attention (GCA) is an advanced attention mechanism that integrates coordinate-based spatial encoding with channel grouping to model global dependencies efficiently.
  • It generates per-group, axis-specific attention maps to enhance segmentation accuracy, especially for complex and fine-scale structures in medical imaging and computer vision.
  • By partitioning channels and combining average and max pooling, GCA achieves improved semantic delineation with minimal computational overhead compared to traditional methods.

Grouped Coordinate Attention (GCA) is an advanced neural attention mechanism that strategically combines coordinate-based spatial encoding with channel grouping. GCA builds upon Coordinate Attention (CA) by embedding fine-grained, direction-aware positional information into spatially distributed channel groups, allowing convolutional backbones to model global dependencies while maintaining computational efficiency. GCA provides per-group, axis-wise attention maps, enhancing representation diversity, sensitivity to semantic heterogeneity, and boundary fidelity in high-resolution, multiorgan data, with application demonstrated in medical image segmentation and efficient computer vision architectures (Hou et al., 2021, Ding et al., 18 Nov 2025, Ding et al., 30 Dec 2025).

1. Coordinate Attention: Foundation and Limitations

Coordinate Attention (CA) extends channel attention by explicitly encoding spatial positional information in two orthogonal directions. Given a feature tensor XRC×H×WX\in\mathbb{R}^{C\times H\times W}, CA factorizes global pooling into 1D direction-specific pools: zch(h)=1Wi=1WXc(h,i)RH,zcw(w)=1Hj=1HXc(j,w)RW.z^h_{c}(h) = \frac{1}{W} \sum_{i=1}^{W} X_c(h,i) \in \mathbb{R}^H, \qquad z^w_{c}(w) = \frac{1}{H} \sum_{j=1}^{H} X_c(j,w) \in \mathbb{R}^W. These are concatenated and passed through a shared bottleneck and independent 1×1 convolutions, resulting in axis-specific attention maps AhA^h and AwA^w. The final output scaling is: Yc(h,w)=Xc(h,w)×Ach(h)×Acw(w).Y_c(h,w) = X_c(h,w) \times A^h_c(h) \times A^w_c(w).

CA delivers spatially selective attention at negligible extra cost (3C2/r\sim 3C^2/r parameters for reduction ratio rr). However, CA operates uniformly across all channels, limiting its capacity to model heterogeneous semantic cues—especially problematic in contexts like multi-organ segmentation or fine-scale structure delineation (Hou et al., 2021).

2. Grouped Coordinate Attention: Mathematical Formulation

GCA addresses CA’s limitations by partitioning the feature tensor into GG disjoint channel groups, each processed independently. For XRB×C×H×WX\in\mathbb{R}^{B\times C\times H\times W}, GCA splits channels into GG groups (Cg=C/GC_g=C/G), so X=[X1,,XG]X=[X_1,\ldots,X_G], XgRB×Cg×H×WX_g\in\mathbb{R}^{B\times C_g\times H\times W}.

Within each group gg:

  • Directional pooling: Both average and max pooling are applied along height and width:

favgh=1Wj=1WXg[:,:,:,j],fmaxh=maxj=1..WXg[:,:,:,j]f^h_{\mathrm{avg}} = \frac{1}{W} \sum_{j=1}^W X_g[:,:,:,j], \quad f^h_{\mathrm{max}} = \max_{j=1..W} X_g[:,:,:,j]

favgw=1Hi=1HXg[:,:,i,:],fmaxw=maxi=1..HXg[:,:,i,:]f^w_{\mathrm{avg}} = \frac{1}{H} \sum_{i=1}^H X_g[:,:,i,:], \quad f^w_{\mathrm{max}} = \max_{i=1..H} X_g[:,:,i,:]

The results are summed: fh=favgh+fmaxhf^h = f^h_{\mathrm{avg}} + f^h_{\mathrm{max}}, fw=favgw+fmaxwf^w = f^w_{\mathrm{avg}} + f^w_{\mathrm{max}}.

  • Bottleneck transformation: Concatenate fhf^h and fwf^w (along the spatial dimension) to form a tensor of shape RB×Cg×(H+W)×1\mathbb{R}^{B\times C_g\times (H+W)\times 1}, and apply a two-stage 1×11\times1 convolutional MLP:

F=δ(BN(Conv1×1([fh;fw])))RB×(Cg/r)×(H+W)×1F = \delta(\mathrm{BN}(\mathrm{Conv}_{1\times1}([f^h;f^w]))) \in \mathbb{R}^{B\times (C_g/r)\times (H+W)\times 1}

Z=σ(BN(Conv1×1(F)))RB×Cg×(H+W)×1Z = \sigma(\mathrm{BN}(\mathrm{Conv}_{1\times1}(F))) \in \mathbb{R}^{B\times C_g\times (H+W)\times 1}

where δ=ReLU\delta = \mathrm{ReLU}, σ=Sigmoid\sigma = \mathrm{Sigmoid}.

  • Attention maps: Split ZZ into AhRB×Cg×H×1A^h \in \mathbb{R}^{B\times C_g\times H\times 1}, AwRB×Cg×1×WA^w \in \mathbb{R}^{B\times C_g\times 1\times W}.
  • Reweighting: Each group is recalibrated by axis-wise broadcasting:

Yg=XgAhAwY_g = X_g \odot A^h \odot A^w

Finally, concatenate all groups along the channel axis to yield YRB×C×H×WY\in\mathbb{R}^{B\times C\times H\times W} (Hou et al., 2021, Ding et al., 18 Nov 2025, Ding et al., 30 Dec 2025).

3. Computational Characteristics and Scaling

GCA’s parameter and compute overhead depend on GG and reduction ratio rr. For each group, two 1×11\times1 convolutions dominate the cost, each with Cg×(Cg/r)C_g\times (C_g/r) parameters. Across all groups: Total params=2G×(Cg2/r)=2C2/(Gr)\text{Total params} = 2G \times (C_g^2/r) = 2C^2/(G r) This is $1/G$ the cost of vanilla coordinate attention for fixed CC and rr. FLOPs scale similarly. For C=256C=256, G=4G=4, r=16r=16: per-block GCA adds \sim8k parameters, with a total network overhead of \lesssim5%.

Pooling and transform operations are O(BCg(H+W))O(B C_g (H+W)) per group, as opposed to O(BC(H+W))O(B C (H+W)) for non-grouped CA, and vastly less than the O(BC(HW)2)O(B C (HW)^2) complexity of full self-attention on images (Hou et al., 2021, Ding et al., 18 Nov 2025, Ding et al., 30 Dec 2025).

4. Empirical Results and Group Size Effects

Ablation studies on multi-organ medical image segmentation benchmarks such as Synapse and ACDC demonstrate that GCA offers superior trade-offs compared to Squeeze-and-Excitation (SE), CBAM, and ungrouped coordinate attention:

  • Synapse (multi-organ): Baseline ResNet-UNet—81.08% DSC; +SE—82.4%; +CBAM—83.1%; CoordAtt (G=1G=1)—84.3%; GCA (G=4G=4)—86.1%.
  • ACDC (cardiac): U-Net—89.68%; GCA-ResUNet—92.64%.

Optimal group size depends on the dataset. For Synapse, G=4G=4 yields highest Dice, but larger GG can slightly degrade or plateau segmentation quality, suggesting a trade-off between locality and capacity for cross-channel modeling. GCA consistently improves small-structure recall and boundary delineation beyond previous lightweight attention schemes (Ding et al., 18 Nov 2025, Ding et al., 30 Dec 2025).

5. Integration in Network Architectures

In convolutional backbones, GCA is typically inserted after the last 1×11\times1 convolution and batch normalization in a residual or bottleneck block, immediately before residual addition. This preserves residual learning while infusing per-group, direction-aware enhancement. In practice, the implementation leverages grouped convolutions for efficiency, and hyperparameters (GG, rr) are set based on hardware budget and target representational capacity (e.g., G=2G=2 or $4$, r=2r=2 to $16$) (Hou et al., 2021, Ding et al., 18 Nov 2025, Ding et al., 30 Dec 2025).

GCA is agnostic to convolutional backbone choice and can be ported to ResNet, ResNeXt, DenseNet, and U-Net variants. The module is compatible with standard training protocols and does not require pretraining or extensive data augmentation to realize its benefits. Empirically, adding GCA increases parameters and FLOPs by only 1–5%, with negligible inference speed reduction (e.g., 32 fps baseline vs. 30 fps for GCA networks at 224×224224\times224 resolution) (Ding et al., 18 Nov 2025, Ding et al., 30 Dec 2025).

6. Theoretical and Practical Implications

GCA’s decoupling of channel-wise context modeling through explicit group decomposition enhances semantic diversity and reduces detrimental interference among heterogeneous anatomical or texture features, a common limitation of unified attention. By combining average and max pooling, GCA captures both coarse and salient local statistics. Its axis-aware encoding mechanism preserves structured horizontal and vertical dependencies, which is critical in tasks requiring precision for small or elongated regions.

Relative to Transformer-based global self-attention, GCA maintains efficiency and avoids quadratic scaling in spatial dimensions, making it practical for high-resolution or resource-constrained deployment. The module’s flexibility and modest parameter footprint enable integration into both encoder and decoder blocks for dense prediction. In scenarios with multi-organ, low-contrast, or boundary-driven targets, GCA has established new benchmarks for segmentation accuracy, particularly excelling in delineating complex or small structures (Ding et al., 18 Nov 2025, Ding et al., 30 Dec 2025).

7. Summary Table: GCA vs. Alternatives

Module Global Context Parameter Overhead Best-Reported Synapse DSC
SE (Squeeze-Excitation) No 2C2/r2C^2/r 82.4%
CA (Coordinate Attention) Yes (unified) 3C2/r3C^2/r 84.3%
GCA (G=4G=4) Yes (grouped) 2C2/(Gr)2C^2/(G r) 86.1%
Self-attention (img.) Yes (full) 4C24C^2

SE and CA: as in (Hou et al., 2021); GCA: (Ding et al., 18 Nov 2025, Ding et al., 30 Dec 2025).

GCA establishes a principled approach to fusing fine-grained coordinate encoding with explicit channel grouping, enabling efficient, global, and semantically disentangled attention for high-resolution vision applications. This suggests a promising direction for further research into structured, lightweight attention modules for dense prediction and edge computing deployments.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Grouped Coordinate Attention (GCA) Module.