Shifted Window Scheme in Transformers

Updated 8 February 2026

Shifted Window Scheme is a strategy that partitions feature maps into fixed windows and then shifts them to allow efficient local self-attention and cross-window communication.
It reduces quadratic complexity to linear by confining self-attention to smaller windows and cyclically shifting these windows to expand the receptive field.
This method underpins models like the Swin Transformer, significantly advancing performance in image recognition, medical imaging, and sequence modeling.

The shifted window scheme is a general architectural principle and computational strategy, first developed in the context of hierarchical vision transformers, designed to enable efficient local self-attention while ensuring information is propagated across spatial or sequential boundaries through intentional spatial or sequence-wise shifts. The method achieves linear complexity with respect to input size by restricting self-attention to small, non-overlapping windows, and then alternately shifting the window partitions in subsequent layers so that each token’s receptive field grows across the network. This architecture has been foundational in a broad range of transformers for images, video, volumetric, and sequence data, and has enabled scalable long-range modeling in various modalities, as well as inspired variants for both self-attention and MLP-dominated models. It is typified by the Swin Transformer family and their derivatives (Liu et al., 2021, Huang et al., 12 Apr 2025, Li et al., 2023, Wang et al., 2024, Go et al., 2022, Li et al., 2023, Gu et al., 29 Jul 2025, Chen et al., 2024, Yu et al., 2022, Cai et al., 2024, Khadka et al., 10 Sep 2025, Smith et al., 2023, Bojesomo et al., 2022, Rowshan et al., 2020).

1. Formal Definition and Mathematical Structure

The shifted window scheme, as realized in Swin Transformer and its descendants, operates via a two-phase partitioning of feature maps (or sequences):

Partition into Local Windows: Given a tensor (e.g., $Z\in \mathbb{R}^{H \times W \times C}$ for images), partition the feature map into non-overlapping $M\times M$ windows; in 3D, this generalizes to $M\times M\times M$ cubes, and in 1D to length- $W$ windows (Liu et al., 2021, Li et al., 2023, Li et al., 2023, Smith et al., 2023).
Window-based Self-Attention (W-MSA): Within each window, compute standard multi-head self-attention independently:

$Q = XW_Q,\quad K = XW_K,\quad V = XW_V$

$\text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{Q K^T}{\sqrt{d}} + B\right)V$

where $B$ is a learned relative positional bias. The outputs of all windows are then merged to rebuild the full map.

Shifted Window Partition (SW-MSA): In alternating layers, the feature map is cyclically shifted by $(\lfloor M/2\rfloor, \lfloor M/2\rfloor)$ (or corresponding offsets in higher/lower dimensions), then the partitioning and W-MSA are performed, after which the shift is reversed (Liu et al., 2021, Li et al., 2023).

Implementation includes masks to block attention across shifted window boundaries, ensuring that attention remains within the intended token groups (Liu et al., 2021, Gu et al., 29 Jul 2025).

This process yields a sequence of blocks where information is locally mixed but the shifted partition in every other layer ensures tokens near window boundaries in one layer attend to their neighbors in the next, efficiently linking local neighborhoods across the network.

2. Theoretical Properties and Computational Complexity

The primary theoretical advantage of the shifted window scheme is the reduction from quadratic to linear computational and memory complexity with respect to the input size.

For a $H\times W$ feature map and window size $M$ , global self-attention complexity is $O((HW)^2 C)$ . Window-based (W-MSA) and shifted-window (SW-MSA) layers each cost $O(HW M^2 C)$ , as only $M^2$ tokens interact within each window and the number of windows is $HW/M^2$ (Liu et al., 2021, Boulaabi et al., 20 Apr 2025).

By alternately shifting the window grid, the method ensures every token can access information from neighbors up to $2M\times 2M$ grid over two layers, and—through hierarchical stacking—enables global propagation in $O(NM)$ time, where $N=HW$ (Liu et al., 2021, Boulaabi et al., 20 Apr 2025, Li et al., 2023, Li et al., 2023).

Complexity remains $O(N M^2 C)$ for fixed $M$ , enabling scalable attention even in high-dimensional signals (images, video, 3D medical images, genomics) (Liu et al., 2021, Imran et al., 2024, Bojesomo et al., 2022, Li et al., 2023).

3. Practical Implementations: Algorithms and Variants

The following pseudocode captures the canonical Swin-style block:

def shifted_window_attention(X, M, shift):
    # X: H x W x C
    H, W, C = X.shape
    s = M // 2 if shift else 0
    # (1) Optionally shift feature map
    if shift:
        X = roll(X, shift=(-s, -s), axis=(0, 1))
    # (2) Partition into MxM windows
    windows = partition(X, M)
    # (3) Self-attention in each window
    outputs = [window_self_attention(w) for w in windows]
    X_out = merge(outputs, H, W)
    # (4) Reverse shift
    if shift:
        X_out = roll(X_out, shift=(s, s), axis=(0, 1))
    return X_out

This pattern generalizes to 3D (e.g., $M\times M\times M$ cubes, cyclic shift across D, H, W axes) for volumetric data (Li et al., 2023, Imran et al., 2024, Bojesomo et al., 2022).

Many variants exist:

Group Shifted Window Attention (GSWA) in AgileIR splits attention heads into groups to further reduce memory in backpropagation while maintaining shifted connection logic (Cai et al., 2024).
1D-Shifted Windows for long sequences use cyclic shifts and non-overlapping windows on 1D token arrays (e.g., in genomics or frequency estimation) (Li et al., 2023, Smith et al., 2023).
Gated MLPs (gSwin) and other non-attention mechanisms can also use the shifted window backbone for parameter-efficient local/global mixing (Go et al., 2022).
Padding-Free Shifting adjusts the boundary treatment to maximize parallelization and minimize overhead (Go et al., 2022).

4. Empirical Results and Applications

The shifted window scheme has been validated across diverse data modalities and application domains:

Computer Vision: Swin Transformer achieves top-1 ImageNet-1K accuracy of $87.3\%$ , significantly surpasses baselines in COCO detection and ADE20K segmentation, and forms the backbone for semantic segmentation, object detection, and shadow detection—outperforming non-shifted or global attention models both in accuracy and efficiency (Liu et al., 2021, Wang et al., 2024, Gu et al., 29 Jul 2025).
Medical Imaging: Multi-view, multi-modal fundus fusion benefits from SW-MSA in retinopathy diagnosis ( $82.53\%$ accuracy), and shifted windows are effectively used in OCTA vessel segmentation and 3D aorta/blood vessel segmentation in CT (Huang et al., 12 Apr 2025, Chen et al., 2024, Imran et al., 2024).
Genomics/Sequence Modeling: 1D-Swin enables modeling of 17K-bp DNA sequences, capturing both local and distal regulatory dependencies more efficiently than quadratic self-attention (Li et al., 2023).
Signal Processing: Shifted windows in frequency estimation yield state-of-the-art performance over both neural and traditional spectral estimators, particularly in low SNR regimes (Smith et al., 2023).
Vision Transformers for Small-Scale Data: Convolutional enhancements to shifted window blocks (CoSwin) yield improvements over both pure convolutional and transformer models in small-data settings (Khadka et al., 10 Sep 2025).
List Decoding of Polar Codes: An unrelated but nomenclaturally similar “shifted window” pruning scheme generalizes path metric filtering during list decoding, yielding up to $0.5$ dB gain in error correction at essentially conventional complexity (Rowshan et al., 2020).

The following table summarizes domains and outcomes:

Domain	Shifted Window Variant	Empirical Gains/Benefits
Image/video recognition	Standard SW-MSA	↑Accuracy, ↓Complexity vs. ViT/CNN (Liu et al., 2021)
Medical image fusion	Multi-view/3D SW-MSA	↑Diagnostic accuracy, efficient multi-scale fusion (Huang et al., 12 Apr 2025, Li et al., 2023)
Shadow/vessel detection	SW-MSA + custom modules	Improved disambiguation, Dice, BER (Wang et al., 2024, Chen et al., 2024)
Genomics, frequency est.	1D/Complex SW-MSA	Efficient long-range modeling (Li et al., 2023, Smith et al., 2023)
Efficient image resto.	GSWA (grouped SW-MSA)	↑Memory efficiency, minimal PSNR loss (Cai et al., 2024)
Coding theory	Shifted pruning (SCL decoding)	+0.25–0.5 dB FER gain at constant complexity (Rowshan et al., 2020)

5. Impact, Limitations, and Evolving Perspectives

The shifted window scheme has become a foundational mechanism for scalable attention and lightweight long-range modeling. Its practical impact is observed in the wide adoption of Swin Transformer backbones in vision, multi-dimensional data, and time-series signal modeling (Liu et al., 2021, Li et al., 2023, Imran et al., 2024, Huang et al., 12 Apr 2025, Li et al., 2023, Smith et al., 2023).

However, ablation and comparative studies point to certain boundaries on its necessity:

Win Transformer demonstrates that a well-designed depthwise convolution can fully obviate the gains of shifted windows in vision domains, yielding equal or higher accuracy for the same complexity, and further simplifying implementation (Yu et al., 2022). This suggests the shift may become less critical if cross-window mixing is achieved by other local architectural primitives.
Efficient Implementations: Grouping, low-dimensional projections, and padding-free schemes further reduce resource requirements without loss of accuracy (Cai et al., 2024, Go et al., 2022).
Theoretical Considerations: Window shifting is a highly generic mechanism for scalable context mixing and can be used in non-attention-based layers, strongly suggesting the underlying principle—localized operations alternated with spatial shifts—exhibits wide applicability.

6. Extensions, Variants, and Domain-Specific Adaptations

The shifted window scheme has been generalized and adapted in a wide range of architectural settings:

3D and Volumetric Data: SW-MSA extends naturally by shifting along depth, height, and width; feature maps are partitioned into 3D blocks and shifted along all three axes (Li et al., 2023, Imran et al., 2024, Bojesomo et al., 2022).
1D and Sequence Data: For genomic and spectral analysis, rolling (cyclically shifting) the 1D sequence by half-window achieves the same cross-window communication at linear complexity (Li et al., 2023, Smith et al., 2023).
Non-attention Blocks: The core principle has been adopted in all-MLP and gated-MLP blocks, using shifted spatial grouping or gating instead of self-attention for each window (Go et al., 2022).
Hybrid/Context-Infused Schemes: Context-aware shifted windows in medical image segmentation integrate patch merging for extra global context at bottlenecks (Imran et al., 2024).
Resource-Efficient Adaptations: GSWA and related techniques split attention heads into groups, reduce Q/K dimensionalities, and introduce late residual sharing between groups for further memory savings (Cai et al., 2024).
Applications Beyond Attention: In list decoding, “shifted window" refers to dynamic adjustment of the candidate window in SCL decoding to prevent premature correct-path pruning (Rowshan et al., 2020).

7. Significance and Future Directions

The shifted window scheme has had marked influence across deep learning research, becoming synonymous with scalable, local-plus-global modeling across multiple data types and modalities. Its hierarchical induction bias and empirically validated linear complexity have precipitated widespread adoption in settings previously limited by the quadratic cost of global attention (Liu et al., 2021, Li et al., 2023, Imran et al., 2024, Li et al., 2023).

Key identified future directions include further exploration of alternative efficient cross-window communication strategies (including convolutional and grouped mechanisms), direct application to non-attention blocks, and domain-specific adaptations for multi-modal, multi-scale, and long-sequence scenarios.

The existence of high-performing alternatives without explicit shifting—such as depthwise convolutional mixing—has reignited discussion of the minimal necessary ingredients for scalable transformer-like behaviors, suggesting a productive research direction in hybrid or simpler local-global information mixing layers (Yu et al., 2022). Nonetheless, the shifted window principle remains a central technical tool for constructing scalable, high-performing architectures in modern machine learning.

Selected references: (Liu et al., 2021, Li et al., 2023, Huang et al., 12 Apr 2025, Wang et al., 2024, Go et al., 2022, Li et al., 2023, Gu et al., 29 Jul 2025, Chen et al., 2024, Yu et al., 2022, Cai et al., 2024, Khadka et al., 10 Sep 2025, Smith et al., 2023, Bojesomo et al., 2022, Rowshan et al., 2020, Imran et al., 2024, Boulaabi et al., 20 Apr 2025).