Grouped Coordinate Attention Module

Updated 5 January 2026

Grouped Coordinate Attention (GCA) is an advanced attention mechanism that integrates coordinate-based spatial encoding with channel grouping to model global dependencies efficiently.
It generates per-group, axis-specific attention maps to enhance segmentation accuracy, especially for complex and fine-scale structures in medical imaging and computer vision.
By partitioning channels and combining average and max pooling, GCA achieves improved semantic delineation with minimal computational overhead compared to traditional methods.

Grouped Coordinate Attention (GCA) is an advanced neural attention mechanism that strategically combines coordinate-based spatial encoding with channel grouping. GCA builds upon Coordinate Attention (CA) by embedding fine-grained, direction-aware positional information into spatially distributed channel groups, allowing convolutional backbones to model global dependencies while maintaining computational efficiency. GCA provides per-group, axis-wise attention maps, enhancing representation diversity, sensitivity to semantic heterogeneity, and boundary fidelity in high-resolution, multiorgan data, with application demonstrated in medical image segmentation and efficient computer vision architectures (Hou et al., 2021, Ding et al., 18 Nov 2025, Ding et al., 30 Dec 2025).

1. Coordinate Attention: Foundation and Limitations

Coordinate Attention (CA) extends channel attention by explicitly encoding spatial positional information in two orthogonal directions. Given a feature tensor $X\in\mathbb{R}^{C\times H\times W}$ , CA factorizes global pooling into 1D direction-specific pools: $z^h_{c}(h) = \frac{1}{W} \sum_{i=1}^{W} X_c(h,i) \in \mathbb{R}^H, \qquad z^w_{c}(w) = \frac{1}{H} \sum_{j=1}^{H} X_c(j,w) \in \mathbb{R}^W.$ These are concatenated and passed through a shared bottleneck and independent 1×1 convolutions, resulting in axis-specific attention maps $A^h$ and $A^w$ . The final output scaling is: $Y_c(h,w) = X_c(h,w) \times A^h_c(h) \times A^w_c(w).$

CA delivers spatially selective attention at negligible extra cost ( $\sim 3C^2/r$ parameters for reduction ratio $r$ ). However, CA operates uniformly across all channels, limiting its capacity to model heterogeneous semantic cues—especially problematic in contexts like multi-organ segmentation or fine-scale structure delineation (Hou et al., 2021).

2. Grouped Coordinate Attention: Mathematical Formulation

GCA addresses CA’s limitations by partitioning the feature tensor into $G$ disjoint channel groups, each processed independently. For $X\in\mathbb{R}^{B\times C\times H\times W}$ , GCA splits channels into $G$ groups ( $C_g=C/G$ ), so $X=[X_1,\ldots,X_G]$ , $X_g\in\mathbb{R}^{B\times C_g\times H\times W}$ .

Within each group $g$ :

Directional pooling: Both average and max pooling are applied along height and width:

$f^h_{\mathrm{avg}} = \frac{1}{W} \sum_{j=1}^W X_g[:,:,:,j], \quad f^h_{\mathrm{max}} = \max_{j=1..W} X_g[:,:,:,j]$

$f^w_{\mathrm{avg}} = \frac{1}{H} \sum_{i=1}^H X_g[:,:,i,:], \quad f^w_{\mathrm{max}} = \max_{i=1..H} X_g[:,:,i,:]$

The results are summed: $f^h = f^h_{\mathrm{avg}} + f^h_{\mathrm{max}}$ , $f^w = f^w_{\mathrm{avg}} + f^w_{\mathrm{max}}$ .

Bottleneck transformation: Concatenate $f^h$ and $f^w$ (along the spatial dimension) to form a tensor of shape $\mathbb{R}^{B\times C_g\times (H+W)\times 1}$ , and apply a two-stage $1\times1$ convolutional MLP:

$F = \delta(\mathrm{BN}(\mathrm{Conv}_{1\times1}([f^h;f^w]))) \in \mathbb{R}^{B\times (C_g/r)\times (H+W)\times 1}$

$Z = \sigma(\mathrm{BN}(\mathrm{Conv}_{1\times1}(F))) \in \mathbb{R}^{B\times C_g\times (H+W)\times 1}$

where $\delta = \mathrm{ReLU}$ , $\sigma = \mathrm{Sigmoid}$ .

Attention maps: Split $Z$ into $A^h \in \mathbb{R}^{B\times C_g\times H\times 1}$ , $A^w \in \mathbb{R}^{B\times C_g\times 1\times W}$ .
Reweighting: Each group is recalibrated by axis-wise broadcasting:

$Y_g = X_g \odot A^h \odot A^w$

Finally, concatenate all groups along the channel axis to yield $Y\in\mathbb{R}^{B\times C\times H\times W}$ (Hou et al., 2021, Ding et al., 18 Nov 2025, Ding et al., 30 Dec 2025).

3. Computational Characteristics and Scaling

GCA’s parameter and compute overhead depend on $G$ and reduction ratio $r$ . For each group, two $1\times1$ convolutions dominate the cost, each with $C_g\times (C_g/r)$ parameters. Across all groups: $\text{Total params} = 2G \times (C_g^2/r) = 2C^2/(G r)$ This is $1/G$ the cost of vanilla coordinate attention for fixed $C$ and $r$ . FLOPs scale similarly. For $C=256$ , $G=4$ , $r=16$ : per-block GCA adds $\sim$ 8k parameters, with a total network overhead of $\lesssim$ 5%.

Pooling and transform operations are $O(B C_g (H+W))$ per group, as opposed to $O(B C (H+W))$ for non-grouped CA, and vastly less than the $O(B C (HW)^2)$ complexity of full self-attention on images (Hou et al., 2021, Ding et al., 18 Nov 2025, Ding et al., 30 Dec 2025).

4. Empirical Results and Group Size Effects

Ablation studies on multi-organ medical image segmentation benchmarks such as Synapse and ACDC demonstrate that GCA offers superior trade-offs compared to Squeeze-and-Excitation (SE), CBAM, and ungrouped coordinate attention:

Synapse (multi-organ): Baseline ResNet-UNet—81.08% DSC; +SE—82.4%; +CBAM—83.1%; CoordAtt ( $G=1$ )—84.3%; GCA ( $G=4$ )—86.1%.
ACDC (cardiac): U-Net—89.68%; GCA-ResUNet—92.64%.

Optimal group size depends on the dataset. For Synapse, $G=4$ yields highest Dice, but larger $G$ can slightly degrade or plateau segmentation quality, suggesting a trade-off between locality and capacity for cross-channel modeling. GCA consistently improves small-structure recall and boundary delineation beyond previous lightweight attention schemes (Ding et al., 18 Nov 2025, Ding et al., 30 Dec 2025).

5. Integration in Network Architectures

In convolutional backbones, GCA is typically inserted after the last $1\times1$ convolution and batch normalization in a residual or bottleneck block, immediately before residual addition. This preserves residual learning while infusing per-group, direction-aware enhancement. In practice, the implementation leverages grouped convolutions for efficiency, and hyperparameters ( $G$ , $r$ ) are set based on hardware budget and target representational capacity (e.g., $G=2$ or $4$, $r=2$ to $16$) (Hou et al., 2021, Ding et al., 18 Nov 2025, Ding et al., 30 Dec 2025).

GCA is agnostic to convolutional backbone choice and can be ported to ResNet, ResNeXt, DenseNet, and U-Net variants. The module is compatible with standard training protocols and does not require pretraining or extensive data augmentation to realize its benefits. Empirically, adding GCA increases parameters and FLOPs by only 1–5%, with negligible inference speed reduction (e.g., 32 fps baseline vs. 30 fps for GCA networks at $224\times224$ resolution) (Ding et al., 18 Nov 2025, Ding et al., 30 Dec 2025).

6. Theoretical and Practical Implications

GCA’s decoupling of channel-wise context modeling through explicit group decomposition enhances semantic diversity and reduces detrimental interference among heterogeneous anatomical or texture features, a common limitation of unified attention. By combining average and max pooling, GCA captures both coarse and salient local statistics. Its axis-aware encoding mechanism preserves structured horizontal and vertical dependencies, which is critical in tasks requiring precision for small or elongated regions.

Relative to Transformer-based global self-attention, GCA maintains efficiency and avoids quadratic scaling in spatial dimensions, making it practical for high-resolution or resource-constrained deployment. The module’s flexibility and modest parameter footprint enable integration into both encoder and decoder blocks for dense prediction. In scenarios with multi-organ, low-contrast, or boundary-driven targets, GCA has established new benchmarks for segmentation accuracy, particularly excelling in delineating complex or small structures (Ding et al., 18 Nov 2025, Ding et al., 30 Dec 2025).

7. Summary Table: GCA vs. Alternatives

Module	Global Context	Parameter Overhead	Best-Reported Synapse DSC
SE (Squeeze-Excitation)	No	$2C^2/r$	82.4%
CA (Coordinate Attention)	Yes (unified)	$3C^2/r$	84.3%
GCA ( $G=4$ )	Yes (grouped)	$2C^2/(G r)$	86.1%
Self-attention (img.)	Yes (full)	$4C^2$	–

SE and CA: as in (Hou et al., 2021); GCA: (Ding et al., 18 Nov 2025, Ding et al., 30 Dec 2025).

GCA establishes a principled approach to fusing fine-grained coordinate encoding with explicit channel grouping, enabling efficient, global, and semantically disentangled attention for high-resolution vision applications. This suggests a promising direction for further research into structured, lightweight attention modules for dense prediction and edge computing deployments.