Papers
Topics
Authors
Recent
Search
2000 character limit reached

Swin-Res-Net: Hybrid CNN-Transformer

Updated 8 February 2026
  • Swin-Res-Net is a hybrid neural architecture that integrates Swin Transformer blocks with residual convolutional networks to capture both local details and global context.
  • It employs a dual-path encoder with advanced fusion and skip connection techniques to align and merge features from convolutional and self-attention branches.
  • The architecture achieves state-of-the-art segmentation performance on tasks like VHR aerial road extraction and retinal vessel delineation, boosting metrics such as IoU and F1-score.

Swin-Res-Net refers to a class of deep neural architectures characterized by the integration of Swin Transformer blocks and convolutional networks, specifically residual connections (ResNet or Res2Net), within a unified encoder-decoder or contextual encoder framework. These architectures leverage the complementary strengths of CNNs—robust local feature extraction—and self-attention-based Transformers—global context modeling—via structured fusion and multi-path design. Swin-Res-Net models have demonstrated leading performance in pixel-wise segmentation tasks such as very high-resolution (VHR) aerial road extraction and retinal vascular structure delineation, achieving state-of-the-art metrics on standardized datasets by virtue of precisely engineered block- and connection-level innovations (Chen et al., 2022, Yang et al., 2024).

1. Architectural Foundations and Dual-Path Design

Swin-Res-Net architectures consistently adopt a dual-branch or two-path encoding concept, combining a convolutional path (ResNet, Res2Net) with a Swin Transformer path. Each input tensor of size H×W×CH \times W \times C is processed in parallel by both branches, and their outputs are subject to feature alignment and fusion before being forwarded to the next stage or decoding step.

  • Convolutional branch (ResNet/Res2Net): Implements either stacked standard residual blocks (each with 3×\times3 conv, batch normalization, ReLU, and identity/projection shortcuts) as in ConSwin (Chen et al., 2022), or the Res2Net module for multi-scale channel-wise partitioning and aggregated receptive fields (Yang et al., 2024). For Res2Net, the input is split into ss subsets, and successive subsets are processed via 3×\times3 convolutions recursively:

Y1=U1,Yi={Conv3×3(Ui),i=2, Conv3×3(Ui+Yi1),i>2.Y_1 = U_1,\quad Y_i = \begin{cases} \mathrm{Conv_{3\times3}}(U_i), & i=2,\ \mathrm{Conv_{3\times3}}(U_i + Y_{i-1}), & i>2. \end{cases}

Outputs are concatenated channel-wise and projected back to CC channels via 1×\times1 conv.

  • Swin Transformer branch: Utilizes patch partitioning (e.g., 4×\times4 or 2×\times2) followed by linear embedding to tokens. The core is composed of Swin blocks—alternating stacked blocks with Window-based Multi-Head Self-Attention (W-MSA) and Shifted Window MSA (SW-MSA). SW-MSA cyclically shifts the window partition to enable cross-window interactions:

Attention(Q,K,V)=Softmax(QKTd+B)V,\mathrm{Attention}(Q, K, V) = \mathrm{Softmax}\left(\frac{Q K^T}{\sqrt{d}} + B\right)V,

where dd is per-head dimension, BB is learnable positional bias.

  • Feature Fusion: Outputs from the branches are aligned in scale (e.g., tanh compression in ConSwin) and summed:

Z(i,j,k)=X(i,j,k)+tanh(Y(i,j,k)),(i,j)[H]×[W],  k[1..C].Z(i, j, k) = X'(i, j, k) + \tanh(Y'(i, j, k)),\quad (i, j) \in [H] \times [W],\; k \in [1..C].

In advanced designs (Yang et al., 2024), Fu-blocks employ CBR layers (3×\times3 conv, BN, ReLU), concatenation, and channel-reduction projections, sometimes with high-order mixing (HorBlock).

2. Encoder-Decoder Topologies and Skip Connections

Swin-Res-Net is implemented within hourglass-style encoder-decoder, U-Net-like, or autoencoding networks with ordered staged processing.

  • Encoder: Four stages with downscaling via patch merging or pooling; channels typically increase per stage (e.g., [32,64,128,256][32, 64, 128, 256] in ConSwin, [C,2C,4C,8C][C, 2C, 4C, 8C] in retinal model). Each stage repeats the dual-path fusion paradigm; block counts vary per design.
  • Decoder: Three or more upsampling stages, each using (typically 2×\times) transpose convolution to recover spatial resolution and halve channel depth. At each stage, decoding features are combined with encoder features through skip connections.
  • Skip connections: Implemented using feature enhancement (FeatConn, via channel pooling and 3×\times3 conv in ConSwin), direct concatenation, or redundancy-eliminating modules. Shape-augmented connections in ConSwin inject Sobel gradient-derived cues from encoder bottleneck to promote boundary preservation.
Stage Encoder Block Decoder Operation Skip/Fusion Method
1 Conv3x3, ConSwin/Res2Net+Swin (n/a: input)
2–4 Patch-merge, ConSwin/FuBlock Transpose Conv FeatConn/RIE-processed concat

All architectural specifics are directly grounded in (Chen et al., 2022) and (Yang et al., 2024).

3. Specialized Modules and Information Processing

Innovative modules are deployed to maximize intra-stage and cross-stage information flow, reduce information loss, and suppress redundancy:

  • Shape-Augmented Connection (shapConn): In ConSwin, a 1×\times1 conv + sigmoid predicts a coarse segmentation at bottleneck, followed by Sobel filtering to extract edge cues. These are projected and added to bottleneck representations.
  • Redundant Information Elimination (RIE): Introduced in (Yang et al., 2024), it computes the stagewise absolute difference between processed encoder features and upsampled features from the subsequent deeper stage:

Rl=AUp(B), with recursive refinement: R(k)l=CBR(R(k1)l)Up[CBR(R(k1)l+1)]R^l = \left|\,A - \mathrm{Up}(B)\right|, \text{ with recursive refinement: } R^l_{(k)} = \left|\,\mathrm{CBR}(R^l_{(k-1)}) - \mathrm{Up}[\mathrm{CBR}(R^{l+1}_{(k-1)})]\,\right|

This operation attenuates non-salient or redundant signals along the skip path.

  • Fu-Blocks: Interactive fusion modules that aggregate Swin output, Res2Net output, and (for deeper stages) prior Fu output, using a sequence of CBR and projection operations.

4. Formulation of Attention and Fusion Mechanisms

Central to Swin-Res-Net's success are mathematically precise fusion and attention operations:

  • Attention Formulation: Within each M×MM \times M window or shifted window, queries QQ, keys KK, and values VV are obtained via learned linear projections. The attention computation is:

Attention(Q,K,V)=Softmax(QKTd+B)V\mathrm{Attention}(Q, K, V) = \mathrm{Softmax}\left(\frac{Q K^T}{\sqrt{d}} + B\right)V

The shifted window mechanism ensures global context transferability.

  • Feedforward Layer: After self-attention, the two-layer MLP is formulated as:

MLP(x)=W2GELU(W1x)+b2\mathrm{MLP}(x) = W_2\,\mathrm{GELU}(W_1 x) + b_2

  • Feature Fusion Across Paths: Compression functions (e.g., tanh) or channel-aligned concatenation followed by 1×\times1 conv are employed to address magnitude and statistical distribution disparities between convolutional and Transformer paths.

5. Layer-Wise Configuration and Hyperparameters

Explicit configuration parameters, as reported in the principal sources, are summarized below:

Component (Chen et al., 2022) (ConSwin) (Yang et al., 2024) (Retina)
Stages (Encoder) 4 4
Channel dims [32, 64, 128, 256] [C, 2C, 4C, 8C]
Swin window size 7 (not specified; typical 7 or 8)
Attention heads [4, 8, 16, 32] (set per dim, not specified)
Patch partition Initial 3×\times3 conv, 2×\times2 merge 4×\times4 partition
Dual blocks/stage [1, 2, 2, 2] ConSwin [2, 2, 6, 2] Swin, [4,6,9,2] Res2Net
Optimizer Adam, lr=2×1042\times10^{-4} Adam, lr=1×1041\times10^{-4}
Loss Weighted cross-entropy ++ smooth-L1 ++ BCE (with learnable uncertainties) Binary cross-entropy

Total parameter count for ConSwin is \approx 47 million (Chen et al., 2022).

6. Empirical Performance and Applications

Swin-Res-Net has been demonstrated to exceed prior baselines in VHR road extraction and retinal vessel segmentation. On Massachusetts and CHN6-CUG for roads, ConSwin achieves superior overall accuracy, IoU, and F1-score. In retinal vessel segmentation, Swin-Res-Net provides AUCs of 0.9956/0.9931/0.9946 and F1-scores up to 0.8665 on CHASE-DB1, DRIVE, and STARE, with gains of 3–10 points in IOU and F1 over previous state-of-the-art models (U-Net++, CS-Net, RV-GAN, FR-U-Net) (Yang et al., 2024).

The observed improvements are attributed to:

  • Swin’s global context modeling through windowed self-attention
  • Enhanced multi-scale aggregation from Res2Net blocks
  • Effective dual-path and interactive fusion strategies
  • Redundancy suppression modules enhancing effective information throughput

7. Significance and Future Directions

Swin-Res-Net architectures represent an established direction in hybrid computer vision encoders for dense prediction. By bridging localized convolutions and hierarchical self-attention via systematic dual-path engineering, Swin-Res-Net realizes both fine-grained detail and global structural reasoning, yielding state-of-the-art results in domains where segmentation of foreground objects is limited by occlusion, class similarity, or subtle topology.

This suggests a plausible future direction involves further generalizing multi-path fusion strategies, adopting adaptive or learned redundancy suppression between encoder and decoder, and extending the Swin-Res-Net paradigm to broader semantic segmentation and medical imaging domains.

Principal references: (Chen et al., 2022, Yang et al., 2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Swin-Res-Net Architecture.