LawinASPP: Multi-scale Attention for Segmentation
- LawinASPP architecture is a multi-scale decoder for semantic segmentation that combines large window attention with spatial pyramid pooling to capture both local and global context.
- It fuses multi-resolution features from hierarchical Vision Transformers to enhance segmentation accuracy while reducing computational cost.
- The design achieves state-of-the-art mIoU results (e.g., 84.4% on Cityscapes) through adaptive pooling and efficient multi-head attention mechanisms.
LawinASPP architecture is a multi-scale decoder head for semantic segmentation Vision Transformers (ViTs) that integrates a novel large window attention mechanism with spatial pyramid pooling. Developed to efficiently capture multi-scale contextual information while maintaining computational economy, LawinASPP serves as the decoder in the Lawin Transformer framework, which establishes new state-of-the-art performance across standard semantic segmentation benchmarks (Yan et al., 2022).
1. Architectural Overview
LawinASPP operates as the decoder within a semantic segmentation pipeline whose encoder is a hierarchical ViT (e.g., MiT or Swin). The encoder provides four multi-resolution feature maps: where each has shape , with , .
The features are upsampled to the spatial resolution of , concatenated along the channel axis, and projected via a convolution to produce: This tensor is processed by LawinASPP, yielding . Subsequently, is upsampled to match 's resolution, concatenated with , passed through another projection, and then through the final classifier.
2. Large Window Attention Mechanism
A key innovation in LawinASPP is large window attention, which enables each local patch in to attend to a substantially broader contextual window with low overhead. The input is partitioned into non-overlapping patches (“query patches”) of size , with .
For each patch:
- The query is defined as .
- The context is a surrounding region of spatial size , , with ratio .
To manage computational growth, context is pooled by an average pooling (downsampling factor ), producing . Multi-head attention is employed with heads. Each head applies a position-mixing MLP across the spatial patch dimension:
The processed features are re-concatenated and projected. Per-head queries, keys, and values are computed, and multi-head attention per patch is assembled as: Aggregated output is: $\mathrm{MHA} = \Concat_i[A_i]\, W_{\rm mha} \in \mathbb{R}^{P^2\times C}$
3. Multi-Scale Contextual Representation
Multi-scale context extraction is achieved by repeating large window attention for . This produces feature maps with effective receptive fields of , , and . This multi-scale approach allows the representation of both local and global context crucial for precise semantic segmentation.
4. Spatial Pyramid Pooling Integration
LawinASPP integrates the multi-scale large window attention outputs via a spatial pyramid pooling design. Its branches are:
- Shortcut: identity mapping of .
- LWA with .
- LWA with .
- LWA with .
- Global pooling: global average pooling of , followed by convolution and upsampling to match spatial size.
Each produces a feature map of shape ; these are concatenated to form a tensor. A convolution reduces channel dimension to , followed by BatchNorm and activation (GELU) produces output : $Y = \sigma\bigl(\mathrm{BN}(\mathrm{Conv}_{1\times1}(\Concat[X, L_2(X), L_4(X), L_8(X), \mathrm{Upsample}(\overline{X})]))\bigr)$ where is the activation function.
5. Hyperparameters and Implementation
Selected hyperparameters are:
- Patch size .
- Ratios .
- Number of heads .
- Channels: for MiT‑B3, ; for MiT‑B5 or Swin‑L, .
Pseudocode for LawinASPP and large window attention is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
function LawinASPP(X): B = [] B.append(X) # shortcut for R in [2,4,8]: B.append(LargeWindowAttn(X,R)) U = GAP(X) U = Conv1x1(U) U = Upsample(U, size=X.shape[2:]) B.append(U) Y = Concat(B, dim=1) # (5C)×H'×W' Y = Conv1x1(Y,C) Y = BN(Y); Y = GELU(Y) return Y function LargeWindowAttn(X,R): for each patch location: Q = slice_patch(X,P) C_full = slice_patch(X, R*P) C_pooled = AvgPool(C_full, k=R, s=R) C_hat = reshape(C_pooled, (h,R_dim,P^2)) for i in range(h): C_hat[i] = MLP_i(C_hat[i]) + C_hat[i] C_P = reshape(concat_h(C_hat), (P^2,C)) A = multihead_attention(Q, C_P, C_P) place_output(A) return reconstructed_feature_map |
6. Computational Complexity and Empirical Results
For feature maps of size , standard local window attention incurs: Lawin attention modifies this to: Since , the increase is marginal.
Empirically, on inputs with MiT‑B3 backbone:
- SegFormer‑B3 w/ ASPP: ~79.0 GFlops
- Lawin‑B3: 61.7 GFlops This yields ~22% FLOP reduction with mIoU on ADE20K improving from 48.7% to 49.9%.
7. Benchmark Performance and Applications
LawinASPP as part of Lawin Transformer achieves state-of-the-art mIoU:
- 84.4% on Cityscapes
- 56.2% on ADE20K
- Competitive results on COCO-Stuff
This demonstrates the architecture’s capacity to enhance both segmentation accuracy and efficiency by leveraging large window attention within a multi-scale spatial pyramid pooling framework (Yan et al., 2022).