Swin-UNETR: Transformer U-Net for 3D Segmentation

Updated 22 January 2026

Swin-UNETR is a hybrid architecture that integrates a hierarchical Swin Transformer encoder with a U-shaped CNN decoder for accurate 3D segmentation.
It utilizes windowed and shifted-window self-attention to capture both multi-scale features and long-range dependencies for precise boundary localization.
The framework supports self-supervised pre-training and transfer learning, achieving state-of-the-art performance across diverse biomedical and geoscience applications.

Swin-UNETR is a hierarchical, windowed-attention Transformer architecture integrated into a U-shaped encoder–decoder segmentation framework. It leverages the Swin Transformer as its encoder, capturing both long-range dependencies and multi-scale features, while connecting to a CNN-based decoder via skip connections for dense prediction tasks in 3D volumes. Originally proposed to address the limitations of fully convolutional networks—namely, their restricted receptive field and reduced capacity for modeling global context—Swin-UNETR has become a state-of-the-art backbone for volumetric segmentation, denoising, super-resolution, and dense error map estimation across diverse biomedical and geoscience domains (Hatamizadeh et al., 2022, Tang et al., 2021, Sadikov et al., 2023, Kakavand et al., 2023, Kakavand et al., 2024).

1. Architectural Principles: Hierarchical Swin Transformer Encoder

Swin-UNETR deploys a 3D Swin Transformer encoder that processes input volumes partitioned into non-overlapping local patches (typically 2×2×2–4×4×4 voxels). Each patch is linearly embedded into a high-dimensional token vector, forming the input sequence for the encoder (Hatamizadeh et al., 2022, Tang et al., 2021). The Swin Transformer organizes computation into sequential stages, each composed of multiple blocks that alternate between window-based multi-head self-attention (W-MSA) and shifted-window MSA (SW-MSA):

$\begin{aligned} \hat z^{l} &= \mathrm{W\text{-}MSA}(\mathrm{LN}(z^{l-1})) + z^{l-1},\ z^{l} &= \mathrm{MLP}(\mathrm{LN}(\hat z^{l})) + \hat z^{l},\ \hat z^{l+1} &= \mathrm{SW\text{-}MSA}(\mathrm{LN}(z^{l})) + z^{l},\ z^{l+1} &= \mathrm{MLP}(\mathrm{LN}(\hat z^{l+1})) + \hat z^{l+1}. \end{aligned}$

The window partitioning restricts attention to local neighborhoods, yielding $O(N M^2 C)$ complexity per layer (with $N$ tokens, $M$ window size, and $C$ channels) and improved scalability to large volumetric data. The shifted-window technique guarantees cross-window connections to support non-local context aggregation (Tang et al., 2021, Hatamizadeh et al., 2022, Jiang et al., 2024).

Patch-merging layers between stages halve the spatial resolution and double the channel dimension, resulting in multi-scale hierarchical feature maps. Common configurations use 4 encoder stages, with token dimensions growing from 48 up to 384 or 768 (application-dependent) and window sizes typically fixed at 7×7×7 for volumetric inputs (Hatamizadeh et al., 2022, Garcia et al., 2023, Kakavand et al., 2024).

2. Decoder Design and Multi-Scale Feature Fusion

The decoder in Swin-UNETR employs a U-shaped, CNN-based structure that mirrors the Swin encoder hierarchy. It uses transposed convolutions to progressively restore spatial resolution. At each stage, upsampled decoder features are concatenated with the corresponding encoder features from the same scale via skip connections, followed by a residual block of 3D convolutions and normalizations:

$\text{Decoder}_{i} = \mathrm{Conv}_{3\times3\times3}(\mathrm{Concat}(\mathrm{Upsample}(\text{Decoder}_{i+1}),\text{Encoder}_{i}))$

This skip connection mechanism enables fine-grained spatial details from early encoder stages to be integrated at each decoding level, critical for accurate boundary localization (Hatamizadeh et al., 2022, Maurya et al., 2022, Yang et al., 2024, Sadikov et al., 2023, Jiang et al., 2024).

Recent extensions such as Swin DER (Yang et al., 2024) enhance the decoder with learnable interpolation (Onsampling), spatial–channel parallel attention gates (SCP AG), and deformable convolution blocks, closing the encoder–decoder capacity gap and yielding measurable improvements in segmentation accuracy and boundary precision.

3. Self-Supervised Pre-Training and Transfer Learning

An important property of Swin-UNETR is its compatibility with self-supervised pre-training, enabling transfer of anatomical or structural priors to downstream tasks (Tang et al., 2021, Kumar, 2023, Zhang et al., 2023). Typical pre-training strategies employ proxy tasks such as masked volume inpainting, 3D rotation prediction, and contrastive coding. The total pre-training objective is

$\mathcal L_{\rm tot} = \mathcal L_{\rm inpaint} + \mathcal L_{\rm contrast} + \mathcal L_{\rm rot}$

where $\mathcal L_{\rm inpaint}$ is an $L_1$ reconstruction loss, $\mathcal L_{\rm rot}$ is cross-entropy on discrete spatial rotations, and $\mathcal L_{\rm contrast}$ is an InfoNCE contrastive loss. This enhances generalizability to label-limited scenarios and improves fine-tuning performance on target datasets (e.g., BTCV, MSD) (Tang et al., 2021).

Spatial and temporal transfer learning has also been validated for geoscience domains (precipitation nowcasting): after pre-training on data-rich regions, Swin-UNETR can be efficiently fine-tuned on new geographies or time periods with only modest drops in performance (Kumar, 2023).

4. Applications and Quantitative Performance

Swin-UNETR has demonstrated utility across multiple domains:

Medical segmentation: Achieves state-of-the-art Dice coefficients (0.913 on BraTS2021 brain tumor (Hatamizadeh et al., 2022), 0.918 on BTCV multi-organ CT (Tang et al., 2021), 84.36% multi-level Dice in PARSE pulmonary artery segmentation (Maurya et al., 2022), 0.873 Dice for blood segmentation in SAH patients (Garcia et al., 2023), >98% DSC for knee bone segmentation (Kakavand et al., 2023)).
Biomechanical modeling: Automated segmentations enable finite-element models with indistinguishable stress/strain results versus manual ground truths (Kakavand et al., 2023, Kakavand et al., 2024).
Preclinical workflows: Outperforms nnU-Net and AIMOS on mouse micro-CT datasets, especially under cross-scanner or protocol variations (Jiang et al., 2024).
Image denoising and super-resolution: Sets new benchmarks in diffusion MRI fidelity, generalization, and scan-time reduction (FA MAE = 0.0496, ninefold scan-time speed-up) (Sadikov et al., 2023).
Dense error map estimation: Produces fine-grained, real-valued spatial maps for registration quality, surpassing categorical or averaged error metrics (Salari et al., 2023).

Performance statistics for select tasks are summarized below.

Application	Dice (%)	Hausdorff (mm)	Other Metrics	Reference
Brain tumor segmentation (BraTS2021)	91.3	5.83	Ensemble, 10 models	(Hatamizadeh et al., 2022)
Multi-organ CT (BTCV)	91.8	—	SOTA on leaderboard	(Tang et al., 2021)
Pulmonary artery segmentation	84.36	—	Multi-level Dice	(Maurya et al., 2022)
Blood segmentation (SAH)	87.3	1.866	IoU, VSI, SASD	(Garcia et al., 2023)
Knee bone segmentation	>98	1.66–1.65	SSM-corrected FE	(Kakavand et al., 2023)
Mouse micro-CT organ segmentation	82.7–91.2	0.25–1.28	Robust under domain	(Jiang et al., 2024)
Diffusion MRI denoising	—	—	FA MAE 0.0496	(Sadikov et al., 2023)

5. Loss Functions, Training Protocols, and Computation

Swin-UNETR supports a spectrum of loss functions:

Dice loss: Quantifies volumetric overlap for segmentations.
Focal loss: Addresses class imbalance, especially for small structures (Kakavand et al., 2024, Kakavand et al., 2023).
Mean squared error: For regression tasks (dense error map, denoising) (Salari et al., 2023, Sadikov et al., 2023).
Combined objectives: Many implementations use a weighted sum of Dice and cross-entropy (Heiliger et al., 2022, Garcia et al., 2023, Yang et al., 2024).
Deep supervision: Losses computed at all decoder scales, with exponentially decaying weights for deeper outputs (Yang et al., 2024).

Optimizers are typically AdamW with cosine (or linear) annealing schedules. Batch sizes are 1–8 (memory constrained), patch sizes vary from 64³ up to 128³ depending on modality and hardware (Hatamizadeh et al., 2022, Kakavand et al., 2023, Jiang et al., 2024, Yang et al., 2024). Inference is commonly sliding-window with overlap averaging. Model sizes approach 62M parameters with FLOPS per scan in the range of 393G–395G (Garcia et al., 2023, Tang et al., 2021).

6. Ensemble Learning, Domain Adaptation, and Robustness

Swin-UNETR integrates naturally into ensemble pipelines, often combined with CNN-based networks (e.g., nnU-Net), with late fusion boosting segmentation metrics (mean Dice +2.4pp improvement in AutoPET) (Heiliger et al., 2022). It is robust against domain shift due to hierarchical self-attention and transfer-learning efficacy—performance degrades less than CNNs under new imaging protocols or scanners (DSC drop <8% vs. 15% for nnU-Net) (Jiang et al., 2024, Kumar, 2023, Sadikov et al., 2023).

Domain adaptation via single-sample fine-tuning yields additional gains on out-of-distribution datasets (e.g., diffusion MRI cohorts) (Sadikov et al., 2023). Statistical shape modeling and geometry filtering further refine segmentations for biomechanical mesh generation, ensuring high-fidelity and watertight surfaces for FE modeling (Kakavand et al., 2024, Kakavand et al., 2023).

7. Limitations and Ongoing Developments

Swin-UNETR is constrained by GPU memory and batch size—scaling beyond 128³ patches is nontrivial. Fixed window geometry may miss long-range dependencies; research into dynamic windows and deformable attention is ongoing (Kumar, 2023, Yang et al., 2024). Decoders in original designs are less optimized compared to transformer encoders, driving developments such as Onsampling, attention gating, and deformable convolution (Yang et al., 2024). Further limitations include dependence on full volume/crop availability and the requirement for precise intensity normalization and pre-registration steps in medical pipelines (Sadikov et al., 2023, Garcia et al., 2023).

References

(Hatamizadeh et al., 2022) Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images
(Tang et al., 2021) Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis
(Sadikov et al., 2023) Generative AI for Rapid Diffusion MRI with Improved Image Quality, Reliability and Generalizability
(Kakavand et al., 2023) Integration of Swin UNETR and statistical shape modeling for a semi-automated segmentation of the knee and biomechanical modeling of articular cartilage
(Kakavand et al., 2024) Swin UNETR segmentation with automated geometry filtering for biomechanical modeling of knee joint cartilage
(Garcia et al., 2023) A Fully Automated Pipeline Using Swin Transformers for Deep Learning-Based Blood Segmentation on Head CT Scans After Aneurysmal Subarachnoid Hemorrhage
(Kumar, 2023) Precipitation Nowcasting With Spatial And Temporal Transfer Learning Using Swin-UNETR
(Yang et al., 2024) Optimizing Medical Image Segmentation with Advanced Decoder Design
(Jiang et al., 2024) Exploring Automated Contouring Across Institutional Boundaries: A Deep Learning Approach with Mouse Micro-CT Datasets
(Maurya et al., 2022) PARSE challenge 2022: Pulmonary Arteries Segmentation using Swin U-Net Transformer(Swin UNETR) and U-Net
(Salari et al., 2023) Dense Error Map Estimation for MRI-Ultrasound Registration in Brain Tumor Surgery Using Swin UNETR
(Heiliger et al., 2022) AutoPET Challenge: Combining nn-Unet with Swin UNETR Augmented by Maximum Intensity Projection Classifier