Papers
Topics
Authors
Recent
Search
2000 character limit reached

Swin UNETR Encoder

Updated 14 January 2026
  • The encoder is a hierarchical backbone that uses multi-stage Swin Transformer blocks with windowed and shifted self-attention to capture rich 3D contextual features.
  • It implements patch embedding, tokenization, and patch merging to progressively reduce spatial resolution while increasing channel dimensions for volumetric data.
  • Skip connections between encoder and decoder layers enable the preservation of fine spatial details, enhancing segmentation and classification accuracy.

The Swin UNETR encoder is a hierarchical sequence-to-sequence backbone for volumetric representation learning, designed to extract multi-scale features from 3D medical imaging data and other dense volumetric sources. It utilizes a multi-stage Swin Transformer architecture with windowed and shifted-window self-attention, patch embedding, hierarchical downsampling, and positional bias mechanisms, and is typically coupled to a fully convolutional decoder via skip connections for segmentation or classification tasks (Hatamizadeh et al., 2022, Tang et al., 2021, Jiang et al., 2024, Kakavand et al., 2023, Bengtsson et al., 7 Jan 2026).

1. Patch Embedding and Tokenization

The Swin UNETR encoder begins with a patch partitioning module that divides the input volume (or multi-channel 3D image) into non-overlapping cubic patches. In the canonical implementation for 3D MRI or CT, the patch size is (2,2,2)(2,2,2) voxels. Each local patch is flattened (e.g., from 2×2×2×C02\times2\times2\times C_0 voxels to a vector of dimension 8×C08\times C_0, where C0C_0 is the input channel count) and then projected into an embedding space using a learnable linear transformation or 3D convolution, producing tokens of dimension CC (e.g., C=48C=48 or C=96C=96). For an input of size H×W×D×C0H\times W\times D\times C_0, this step yields a token grid of shape (H/2)×(W/2)×(D/2)×C(H/2)\times(W/2)\times(D/2)\times C (Jiang et al., 2024, Hatamizadeh et al., 2022, Kakavand et al., 2023). This token grid serves as input to the successive Swin Transformer stages.

2. Hierarchical Swin Transformer Stages and Patch Merging

The encoder is structured into four or five sequential “stages”, each comprising several Swin Transformer blocks. At each stage ii, the token grid has shape (Hi,Wi,Di,Ci)(H_i, W_i, D_i, C_i), with spatial resolution and channel dimension evolving as:

Stage Resolution Channel Dim (CiC_i) Number of Blocks Num. Attention Heads
0 H/2×W/2×D/2H/2\times W/2\times D/2 48 (or 96)
1 H/4×W/4×D/4H/4\times W/4\times D/4 96 (or 192) 2 6 (or 12)
2 H/8×W/8×D/8H/8\times W/8\times D/8 192 (or 384) 2 12 (or 24)
3 H/16×W/16×D/16H/16\times W/16\times D/16 384 (or 768) 2 24
4 H/32×W/32×D/32H/32\times W/32\times D/32 768 2 48

The patch-merging operation between stages concatenates each non-overlapping group of 2×2×22\times2\times2 tokens (eight neighbors) and applies a linear transformation to double the channel dimension (i.e., Ci2CiC_i \to 2C_i) and halve the spatial dimensions. The number of blocks and attention heads per stage typically follow [2,2,2,2] blocks and [3,6,12,24] heads, respectively, with window size M=7M=7 (Hatamizadeh et al., 2022, Kakavand et al., 2023, Jiang et al., 2024, Kakavand et al., 2023).

3. Swin Transformer Block: Windowed and Shifted Self-Attention

Each Swin Transformer block consists of two sublayers applied with pre-normalization and additive residual connections:

  • W-MSA (Windowed Multi-Head Self-Attention): Computes self-attention within non-overlapping cubic windows of size M3M^3 (often M=7M=7) for computational tractability and locality induction.
  • SW-MSA (Shifted Window MSA): In alternate blocks, a cyclic shift by M/2\lfloor M/2\rfloor voxels along each spatial axis ensures cross-window information propagation and augments receptive field without quadratic cost escalation.

In both variants, for each window, input tokens XRM3×CX\in\mathbb{R}^{M^3\times C} are projected to queries/keys/values:

Q=XWQ,K=XWK,V=XWVQ = XW_Q, \quad K = XW_K, \quad V = XW_V

Self-attention is evaluated as: Attention(Q,K,V)=Softmax(QKd+B)V\mathrm{Attention}(Q, K, V) = \mathrm{Softmax}\left(\frac{QK^\top}{\sqrt{d}} + B\right)V where BB is a learnable relative positional bias matrix of size M3×M3M^3\times M^3 shared across window heads (Hatamizadeh et al., 2022, Tang et al., 2021, Kakavand et al., 2023, Jiang et al., 2024). The MLP is a two-layer feed-forward block with hidden size $4C$ and GELU activation.

4. Multi-Resolution Feature Extraction and Skip Connections

After each stage, the encoded tokens are reshaped back into 3D feature maps of size Hi×Wi×Di×CiH_i \times W_i \times D_i \times C_i. These multi-scale representations are retained for skip connections. In the full Swin UNETR model, five levels of encoder outputs (including the initial post-embedding grid) are passed via dedicated skip pathways to corresponding decoder layers, maintaining spatial fidelity and facilitating U-shaped reconstructions for semantic segmentation (Hatamizadeh et al., 2022, Tang et al., 2021, Jiang et al., 2024). Each skip tensor may be post-processed by a small residual convolutional block before concatenation with upsampled decoder activations.

5. Positional Encoding and Normalization Strategy

The Swin UNETR encoder employs purely relative positional encodings via the bias matrix BB in the self-attention modules—no absolute positional embeddings are present. Layer normalization (LN) is applied before both the attention and the MLP sublayers (“pre-norm”). Drop-path (stochastic depth) may be ramped from zero in shallow blocks up to $0.1$ at depth, but dropout rates within MLP and attention blocks are typically set to zero (Jiang et al., 2024, Hatamizadeh et al., 2022, Kakavand et al., 2023, Kakavand et al., 2023).

6. Integration into Downstream Tasks: Segmentation and Classification

In segmentation pipelines (e.g., brain tumor, bone, cartilage, or mouse organ), the complete Swin UNETR encoder is coupled to a decoder, which utilizes skip connections from all encoder stages to restore fine spatial detail via upsampling and concatenation (Hatamizadeh et al., 2022, Tang et al., 2021, Jiang et al., 2024). For volumetric classification tasks, the deepest Swin encoder stage frequently undergoes global average pooling to yield a dense feature vector, which is processed by a shallow MLP head for final prediction (Bengtsson et al., 7 Jan 2026).

7. Training Hyperparameters and Implementation Details

Key settings for Swin UNETR encoders include:

  • Input patch size: (2,2,2)(2,2,2)
  • Window size: $7$
  • Embedding dimensions: progression [48,96,192,384,768][48,96,192,384,768] or similar; heads per stage [3,6,12,24][3,6,12,24]; blocks per stage $2$
  • Optimizer: Adam, typical learning rate 1e41\text{e}{-4}, batch size $8$
  • Loss: Dice loss and binary cross-entropy for segmentation; binary cross-entropy for classification
  • Intensity normalization of input to [0,1][0,1]; random cropping and augmentation (Kakavand et al., 2023, Bengtsson et al., 7 Jan 2026)
  • Decoder: 3D transposed convolution and residual blocks with InstanceNorm+ReLU
  • Inference: sliding windows e.g., 1283128^3 voxels with overlap (Jiang et al., 2024)

Skip connections, patch merging, windowed/shifted self-attention, and multi-resolution extraction form the backbone of the Swin UNETR encoder, providing state-of-the-art results across a range of volumetric segmentation and classification benchmarks (Hatamizadeh et al., 2022, Tang et al., 2021, Jiang et al., 2024, Kakavand et al., 2023, Bengtsson et al., 7 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Swin UNETR Encoder.