Swin-Res-Net: Hybrid CNN-Transformer

Updated 8 February 2026

Swin-Res-Net is a hybrid neural architecture that integrates Swin Transformer blocks with residual convolutional networks to capture both local details and global context.
It employs a dual-path encoder with advanced fusion and skip connection techniques to align and merge features from convolutional and self-attention branches.
The architecture achieves state-of-the-art segmentation performance on tasks like VHR aerial road extraction and retinal vessel delineation, boosting metrics such as IoU and F1-score.

Swin-Res-Net refers to a class of deep neural architectures characterized by the integration of Swin Transformer blocks and convolutional networks, specifically residual connections (ResNet or Res2Net), within a unified encoder-decoder or contextual encoder framework. These architectures leverage the complementary strengths of CNNs—robust local feature extraction—and self-attention-based Transformers—global context modeling—via structured fusion and multi-path design. Swin-Res-Net models have demonstrated leading performance in pixel-wise segmentation tasks such as very high-resolution (VHR) aerial road extraction and retinal vascular structure delineation, achieving state-of-the-art metrics on standardized datasets by virtue of precisely engineered block- and connection-level innovations (Chen et al., 2022, Yang et al., 2024).

1. Architectural Foundations and Dual-Path Design

Swin-Res-Net architectures consistently adopt a dual-branch or two-path encoding concept, combining a convolutional path (ResNet, Res2Net) with a Swin Transformer path. Each input tensor of size $H \times W \times C$ is processed in parallel by both branches, and their outputs are subject to feature alignment and fusion before being forwarded to the next stage or decoding step.

Convolutional branch (ResNet/Res2Net): Implements either stacked standard residual blocks (each with 3 $\times$ 3 conv, batch normalization, ReLU, and identity/projection shortcuts) as in ConSwin (Chen et al., 2022), or the Res2Net module for multi-scale channel-wise partitioning and aggregated receptive fields (Yang et al., 2024). For Res2Net, the input is split into $s$ subsets, and successive subsets are processed via 3 $\times$ 3 convolutions recursively:

$Y_1 = U_1,\quad Y_i = \begin{cases} \mathrm{Conv_{3\times3}}(U_i), & i=2,\ \mathrm{Conv_{3\times3}}(U_i + Y_{i-1}), & i>2. \end{cases}$

Outputs are concatenated channel-wise and projected back to $C$ channels via 1 $\times$ 1 conv.

Swin Transformer branch: Utilizes patch partitioning (e.g., 4 $\times$ 4 or 2 $\times$ 2) followed by linear embedding to tokens. The core is composed of Swin blocks—alternating stacked blocks with Window-based Multi-Head Self-Attention (W-MSA) and Shifted Window MSA (SW-MSA). SW-MSA cyclically shifts the window partition to enable cross-window interactions:

$\mathrm{Attention}(Q, K, V) = \mathrm{Softmax}\left(\frac{Q K^T}{\sqrt{d}} + B\right)V,$

where $\times$ 0 is per-head dimension, $\times$ 1 is learnable positional bias.

Feature Fusion: Outputs from the branches are aligned in scale (e.g., tanh compression in ConSwin) and summed:

$\times$ 2

In advanced designs (Yang et al., 2024), Fu-blocks employ CBR layers (3 $\times$ 33 conv, BN, ReLU), concatenation, and channel-reduction projections, sometimes with high-order mixing (HorBlock).

2. Encoder-Decoder Topologies and Skip Connections

Swin-Res-Net is implemented within hourglass-style encoder-decoder, U-Net-like, or autoencoding networks with ordered staged processing.

Encoder: Four stages with downscaling via patch merging or pooling; channels typically increase per stage (e.g., $\times$ 4 in ConSwin, $\times$ 5 in retinal model). Each stage repeats the dual-path fusion paradigm; block counts vary per design.
Decoder: Three or more upsampling stages, each using (typically 2 $\times$ 6) transpose convolution to recover spatial resolution and halve channel depth. At each stage, decoding features are combined with encoder features through skip connections.
Skip connections: Implemented using feature enhancement (FeatConn, via channel pooling and 3 $\times$ 73 conv in ConSwin), direct concatenation, or redundancy-eliminating modules. Shape-augmented connections in ConSwin inject Sobel gradient-derived cues from encoder bottleneck to promote boundary preservation.

Stage	Encoder Block	Decoder Operation	Skip/Fusion Method
1	Conv3x3, ConSwin/Res2Net+Swin	(n/a: input)	—
2–4	Patch-merge, ConSwin/FuBlock	Transpose Conv	FeatConn/RIE-processed concat

All architectural specifics are directly grounded in (Chen et al., 2022) and (Yang et al., 2024).

3. Specialized Modules and Information Processing

Innovative modules are deployed to maximize intra-stage and cross-stage information flow, reduce information loss, and suppress redundancy:

Shape-Augmented Connection (shapConn): In ConSwin, a 1 $\times$ 81 conv + sigmoid predicts a coarse segmentation at bottleneck, followed by Sobel filtering to extract edge cues. These are projected and added to bottleneck representations.
Redundant Information Elimination (RIE): Introduced in (Yang et al., 2024), it computes the stagewise absolute difference between processed encoder features and upsampled features from the subsequent deeper stage:

$\times$ 9

This operation attenuates non-salient or redundant signals along the skip path.

Fu-Blocks: Interactive fusion modules that aggregate Swin output, Res2Net output, and (for deeper stages) prior Fu output, using a sequence of CBR and projection operations.

4. Formulation of Attention and Fusion Mechanisms

Central to Swin-Res-Net's success are mathematically precise fusion and attention operations:

Attention Formulation: Within each $s$ 0 window or shifted window, queries $s$ 1, keys $s$ 2, and values $s$ 3 are obtained via learned linear projections. The attention computation is:

$s$ 4

The shifted window mechanism ensures global context transferability.

Feedforward Layer: After self-attention, the two-layer MLP is formulated as:

$s$ 5

Feature Fusion Across Paths: Compression functions (e.g., tanh) or channel-aligned concatenation followed by 1 $s$ 61 conv are employed to address magnitude and statistical distribution disparities between convolutional and Transformer paths.

5. Layer-Wise Configuration and Hyperparameters

Explicit configuration parameters, as reported in the principal sources, are summarized below:

Component	(Chen et al., 2022) (ConSwin)	(Yang et al., 2024) (Retina)
Stages (Encoder)	4	4
Channel dims	[32, 64, 128, 256]	[C, 2C, 4C, 8C]
Swin window size	7	(not specified; typical 7 or 8)
Attention heads	[4, 8, 16, 32]	(set per dim, not specified)
Patch partition	Initial 3 $s$ 73 conv, 2 $s$ 82 merge	4 $s$ 94 partition
Dual blocks/stage	[1, 2, 2, 2] ConSwin	[2, 2, 6, 2] Swin, [4,6,9,2] Res2Net
Optimizer	Adam, lr= $\times$ 0	Adam, lr= $\times$ 1
Loss	Weighted cross-entropy $\times$ 2 smooth-L1 $\times$ 3 BCE (with learnable uncertainties)	Binary cross-entropy

Total parameter count for ConSwin is $\times$ 4 47 million (Chen et al., 2022).

6. Empirical Performance and Applications

Swin-Res-Net has been demonstrated to exceed prior baselines in VHR road extraction and retinal vessel segmentation. On Massachusetts and CHN6-CUG for roads, ConSwin achieves superior overall accuracy, IoU, and F1-score. In retinal vessel segmentation, Swin-Res-Net provides AUCs of 0.9956/0.9931/0.9946 and F1-scores up to 0.8665 on CHASE-DB1, DRIVE, and STARE, with gains of 3–10 points in IOU and F1 over previous state-of-the-art models (U-Net++, CS-Net, RV-GAN, FR-U-Net) (Yang et al., 2024).

The observed improvements are attributed to:

Swin’s global context modeling through windowed self-attention
Enhanced multi-scale aggregation from Res2Net blocks
Effective dual-path and interactive fusion strategies
Redundancy suppression modules enhancing effective information throughput

7. Significance and Future Directions

Swin-Res-Net architectures represent an established direction in hybrid computer vision encoders for dense prediction. By bridging localized convolutions and hierarchical self-attention via systematic dual-path engineering, Swin-Res-Net realizes both fine-grained detail and global structural reasoning, yielding state-of-the-art results in domains where segmentation of foreground objects is limited by occlusion, class similarity, or subtle topology.

This suggests a plausible future direction involves further generalizing multi-path fusion strategies, adopting adaptive or learned redundancy suppression between encoder and decoder, and extending the Swin-Res-Net paradigm to broader semantic segmentation and medical imaging domains.

Principal references: (Chen et al., 2022, Yang et al., 2024)