Papers
Topics
Authors
Recent
Search
2000 character limit reached

LawinASPP: Multi-scale Attention for Segmentation

Updated 23 January 2026
  • LawinASPP architecture is a multi-scale decoder for semantic segmentation that combines large window attention with spatial pyramid pooling to capture both local and global context.
  • It fuses multi-resolution features from hierarchical Vision Transformers to enhance segmentation accuracy while reducing computational cost.
  • The design achieves state-of-the-art mIoU results (e.g., 84.4% on Cityscapes) through adaptive pooling and efficient multi-head attention mechanisms.

LawinASPP architecture is a multi-scale decoder head for semantic segmentation Vision Transformers (ViTs) that integrates a novel large window attention mechanism with spatial pyramid pooling. Developed to efficiently capture multi-scale contextual information while maintaining computational economy, LawinASPP serves as the decoder in the Lawin Transformer framework, which establishes new state-of-the-art performance across standard semantic segmentation benchmarks (Yan et al., 2022).

1. Architectural Overview

LawinASPP operates as the decoder within a semantic segmentation pipeline whose encoder is a hierarchical ViT (e.g., MiT or Swin). The encoder provides four multi-resolution feature maps: F1  (O ⁣S=4),F2  (O ⁣S=8),F3  (O ⁣S=16),F4  (O ⁣S=32)F_1\;(O\!S=4),\quad F_2\;(O\!S=8),\quad F_3\;(O\!S=16),\quad F_4\;(O\!S=32) where each FiF_i has shape C×H×WC\times H'\times W', with H=H/O ⁣SH'=H/O\!S, W=W/O ⁣SW'=W/O\!S.

The features {F2,F3,F4}\{F_2, F_3, F_4\} are upsampled to the spatial resolution of F2F_2, concatenated along the channel axis, and projected via a 1×11\times1 convolution to produce: XRC×(H/8)×(W/8)X\in\mathbb{R}^{C\times (H/8)\times (W/8)} This tensor XX is processed by LawinASPP, yielding YRC×(H/8)×(W/8)Y\in\mathbb{R}^{C\times (H/8)\times (W/8)}. Subsequently, YY is upsampled to match F1F_1's resolution, concatenated with F1F_1, passed through another 1×11\times1 projection, and then through the final classifier.

2. Large Window Attention Mechanism

A key innovation in LawinASPP is large window attention, which enables each local patch in XX to attend to a substantially broader contextual window with low overhead. The input XX is partitioned into non-overlapping patches (“query patches”) of size P×PP\times P, with P=8P=8.

For each patch:

  • The query QQ is defined as QRP2×CQ\in \mathbb{R}^{P^2\times C}.
  • The context CC is a surrounding region of spatial size RP×RPR\,P\times R\,P, CR(RP)2×CC\in\mathbb{R}^{(R\,P)^2\times C}, with ratio r=R2r=R^2.

To manage computational growth, context is pooled by an average pooling (downsampling factor RR), producing CpooledRP2×CC_{\mathrm{pooled}}\in\mathbb{R}^{P^2\times C}. Multi-head attention is employed with h=R2h=R^2 heads. Each head applies a position-mixing MLP across the spatial patch dimension: C^i  MLPi  MLPi(C^i)+C^iR(C/h)×P2\widehat{C}_i\xrightarrow{\;\mathrm{MLP}_i\;} \mathrm{MLP}_i(\widehat{C}_i)+\widehat{C}_i \in \mathbb{R}^{(C/h)\times P^2}

The processed features are re-concatenated and projected. Per-head queries, keys, and values are computed, and multi-head attention per patch is assembled as: Ai=softmax(WqiQ(WkiCP)Dh)(WviCP)A_i = \mathrm{softmax}\left(\frac{W_q^i Q (W_k^i C^P)^\top}{\sqrt{D_h}}\right) \left(W_v^i C^P\right) Aggregated output is: $\mathrm{MHA} = \Concat_i[A_i]\, W_{\rm mha} \in \mathbb{R}^{P^2\times C}$

3. Multi-Scale Contextual Representation

Multi-scale context extraction is achieved by repeating large window attention for R{2,4,8}R \in \{2,4,8\}. This produces feature maps with effective receptive fields of 16×1616\times16, 32×3232\times32, and 64×6464\times64. This multi-scale approach allows the representation of both local and global context crucial for precise semantic segmentation.

4. Spatial Pyramid Pooling Integration

LawinASPP integrates the multi-scale large window attention outputs via a spatial pyramid pooling design. Its branches are:

  • Shortcut: identity mapping of XX.
  • LWA with R=2R=2.
  • LWA with R=4R=4.
  • LWA with R=8R=8.
  • Global pooling: global average pooling of XX, followed by 1×11\times1 convolution and upsampling to match spatial size.

Each produces a feature map of shape C×H×WC\times H'\times W'; these are concatenated to form a 5C×H×W5C\times H'\times W' tensor. A 1×11\times1 convolution reduces channel dimension to CC, followed by BatchNorm and activation (GELU) produces output YY: $Y = \sigma\bigl(\mathrm{BN}(\mathrm{Conv}_{1\times1}(\Concat[X, L_2(X), L_4(X), L_8(X), \mathrm{Upsample}(\overline{X})]))\bigr)$ where σ\sigma is the activation function.

5. Hyperparameters and Implementation

Selected hyperparameters are:

  • Patch size P=8P=8.
  • Ratios R{2,4,8}R\in\{2,4,8\}.
  • Number of heads h=R2h=R^2.
  • Channels: for MiT‑B3, C=768C=768; for MiT‑B5 or Swin‑L, C=1024C=1024.

Pseudocode for LawinASPP and large window attention is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
function LawinASPP(X):
  B = []
  B.append(X)  # shortcut
  for R in [2,4,8]:
    B.append(LargeWindowAttn(X,R))
  U = GAP(X)
  U = Conv1x1(U)
  U = Upsample(U, size=X.shape[2:])
  B.append(U)
  Y = Concat(B, dim=1)  # (5C)×H'×W'
  Y = Conv1x1(Y,C)
  Y = BN(Y); Y = GELU(Y)
  return Y

function LargeWindowAttn(X,R):
  for each patch location:
    Q = slice_patch(X,P)
    C_full = slice_patch(X, R*P)
    C_pooled = AvgPool(C_full, k=R, s=R)
    C_hat = reshape(C_pooled, (h,R_dim,P^2))
    for i in range(h):
      C_hat[i] = MLP_i(C_hat[i]) + C_hat[i]
    C_P = reshape(concat_h(C_hat), (P^2,C))
    A = multihead_attention(Q, C_P, C_P)
    place_output(A)
  return reconstructed_feature_map

6. Computational Complexity and Empirical Results

For feature maps XX of size H×WH'\times W', standard local window attention incurs: Ω(Local)=4HWC2+2HWP2C\Omega(\mathrm{Local}) = 4\,H'W'\,C^2 + 2\,H'W'\,P^2 C Lawin attention modifies this to: Ω(Lawin)=4HWC2+3HWP2C\Omega(\mathrm{Lawin}) = 4\,H'W'\,C^2 + 3\,H'W'\,P^2 C Since P2CP^2\ll C, the increase is marginal.

Empirically, on 512×512512\times512 inputs with MiT‑B3 backbone:

  • SegFormer‑B3 w/ ASPP: ~79.0 GFlops
  • Lawin‑B3: 61.7 GFlops This yields ~22% FLOP reduction with mIoU on ADE20K improving from 48.7% to 49.9%.

7. Benchmark Performance and Applications

LawinASPP as part of Lawin Transformer achieves state-of-the-art mIoU:

  • 84.4% on Cityscapes
  • 56.2% on ADE20K
  • Competitive results on COCO-Stuff

This demonstrates the architecture’s capacity to enhance both segmentation accuracy and efficiency by leveraging large window attention within a multi-scale spatial pyramid pooling framework (Yan et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LawinASPP Architecture.