Swin UNETR Encoder

Updated 14 January 2026

The encoder is a hierarchical backbone that uses multi-stage Swin Transformer blocks with windowed and shifted self-attention to capture rich 3D contextual features.
It implements patch embedding, tokenization, and patch merging to progressively reduce spatial resolution while increasing channel dimensions for volumetric data.
Skip connections between encoder and decoder layers enable the preservation of fine spatial details, enhancing segmentation and classification accuracy.

The Swin UNETR encoder is a hierarchical sequence-to-sequence backbone for volumetric representation learning, designed to extract multi-scale features from 3D medical imaging data and other dense volumetric sources. It utilizes a multi-stage Swin Transformer architecture with windowed and shifted-window self-attention, patch embedding, hierarchical downsampling, and positional bias mechanisms, and is typically coupled to a fully convolutional decoder via skip connections for segmentation or classification tasks (Hatamizadeh et al., 2022, Tang et al., 2021, Jiang et al., 2024, Kakavand et al., 2023, Bengtsson et al., 7 Jan 2026).

1. Patch Embedding and Tokenization

The Swin UNETR encoder begins with a patch partitioning module that divides the input volume (or multi-channel 3D image) into non-overlapping cubic patches. In the canonical implementation for 3D MRI or CT, the patch size is $(2,2,2)$ voxels. Each local patch is flattened (e.g., from $2\times2\times2\times C_0$ voxels to a vector of dimension $8\times C_0$ , where $C_0$ is the input channel count) and then projected into an embedding space using a learnable linear transformation or 3D convolution, producing tokens of dimension $C$ (e.g., $C=48$ or $C=96$ ). For an input of size $H\times W\times D\times C_0$ , this step yields a token grid of shape $(H/2)\times(W/2)\times(D/2)\times C$ (Jiang et al., 2024, Hatamizadeh et al., 2022, Kakavand et al., 2023). This token grid serves as input to the successive Swin Transformer stages.

2. Hierarchical Swin Transformer Stages and Patch Merging

The encoder is structured into four or five sequential “stages”, each comprising several Swin Transformer blocks. At each stage $i$ , the token grid has shape $(H_i, W_i, D_i, C_i)$ , with spatial resolution and channel dimension evolving as:

Stage	Resolution	Channel Dim ( $C_i$ )	Number of Blocks	Num. Attention Heads
0	$H/2\times W/2\times D/2$	48 (or 96)	–	–
1	$H/4\times W/4\times D/4$	96 (or 192)	2	6 (or 12)
2	$H/8\times W/8\times D/8$	192 (or 384)	2	12 (or 24)
3	$H/16\times W/16\times D/16$	384 (or 768)	2	24
4	$H/32\times W/32\times D/32$	768	2	48

The patch-merging operation between stages concatenates each non-overlapping group of $2\times2\times2$ tokens (eight neighbors) and applies a linear transformation to double the channel dimension (i.e., $C_i \to 2C_i$ ) and halve the spatial dimensions. The number of blocks and attention heads per stage typically follow [2,2,2,2] blocks and [3,6,12,24] heads, respectively, with window size $M=7$ (Hatamizadeh et al., 2022, Kakavand et al., 2023, Jiang et al., 2024, Kakavand et al., 2023).

3. Swin Transformer Block: Windowed and Shifted Self-Attention

Each Swin Transformer block consists of two sublayers applied with pre-normalization and additive residual connections:

W-MSA (Windowed Multi-Head Self-Attention): Computes self-attention within non-overlapping cubic windows of size $M^3$ (often $M=7$ ) for computational tractability and locality induction.
SW-MSA (Shifted Window MSA): In alternate blocks, a cyclic shift by $\lfloor M/2\rfloor$ voxels along each spatial axis ensures cross-window information propagation and augments receptive field without quadratic cost escalation.

In both variants, for each window, input tokens $X\in\mathbb{R}^{M^3\times C}$ are projected to queries/keys/values:

$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$

Self-attention is evaluated as: $\mathrm{Attention}(Q, K, V) = \mathrm{Softmax}\left(\frac{QK^\top}{\sqrt{d}} + B\right)V$ where $B$ is a learnable relative positional bias matrix of size $M^3\times M^3$ shared across window heads (Hatamizadeh et al., 2022, Tang et al., 2021, Kakavand et al., 2023, Jiang et al., 2024). The MLP is a two-layer feed-forward block with hidden size $4C$ and GELU activation.

4. Multi-Resolution Feature Extraction and Skip Connections

After each stage, the encoded tokens are reshaped back into 3D feature maps of size $H_i \times W_i \times D_i \times C_i$ . These multi-scale representations are retained for skip connections. In the full Swin UNETR model, five levels of encoder outputs (including the initial post-embedding grid) are passed via dedicated skip pathways to corresponding decoder layers, maintaining spatial fidelity and facilitating U-shaped reconstructions for semantic segmentation (Hatamizadeh et al., 2022, Tang et al., 2021, Jiang et al., 2024). Each skip tensor may be post-processed by a small residual convolutional block before concatenation with upsampled decoder activations.

5. Positional Encoding and Normalization Strategy

The Swin UNETR encoder employs purely relative positional encodings via the bias matrix $B$ in the self-attention modules—no absolute positional embeddings are present. Layer normalization (LN) is applied before both the attention and the MLP sublayers (“pre-norm”). Drop-path (stochastic depth) may be ramped from zero in shallow blocks up to $0.1$ at depth, but dropout rates within MLP and attention blocks are typically set to zero (Jiang et al., 2024, Hatamizadeh et al., 2022, Kakavand et al., 2023, Kakavand et al., 2023).

6. Integration into Downstream Tasks: Segmentation and Classification

In segmentation pipelines (e.g., brain tumor, bone, cartilage, or mouse organ), the complete Swin UNETR encoder is coupled to a decoder, which utilizes skip connections from all encoder stages to restore fine spatial detail via upsampling and concatenation (Hatamizadeh et al., 2022, Tang et al., 2021, Jiang et al., 2024). For volumetric classification tasks, the deepest Swin encoder stage frequently undergoes global average pooling to yield a dense feature vector, which is processed by a shallow MLP head for final prediction (Bengtsson et al., 7 Jan 2026).

7. Training Hyperparameters and Implementation Details

Key settings for Swin UNETR encoders include:

Input patch size: $(2,2,2)$
Window size: $7$
Embedding dimensions: progression $[48,96,192,384,768]$ or similar; heads per stage $[3,6,12,24]$ ; blocks per stage $2$
Optimizer: Adam, typical learning rate $1\text{e}{-4}$ , batch size $8$
Loss: Dice loss and binary cross-entropy for segmentation; binary cross-entropy for classification
Intensity normalization of input to $[0,1]$ ; random cropping and augmentation (Kakavand et al., 2023, Bengtsson et al., 7 Jan 2026)
Decoder: 3D transposed convolution and residual blocks with InstanceNorm+ReLU
Inference: sliding windows e.g., $128^3$ voxels with overlap (Jiang et al., 2024)

Skip connections, patch merging, windowed/shifted self-attention, and multi-resolution extraction form the backbone of the Swin UNETR encoder, providing state-of-the-art results across a range of volumetric segmentation and classification benchmarks (Hatamizadeh et al., 2022, Tang et al., 2021, Jiang et al., 2024, Kakavand et al., 2023, Bengtsson et al., 7 Jan 2026).