Offset-Adjusted Mask2Former

Updated 29 December 2025

The paper introduces algebraic offset adjustment strategies within deformable attention, achieving significant segmentation gains (up to +13.6 Dice improvement) on small anatomical structures.
It integrates a fourth-stage CNN feature map as a coarse spatial prior to guide attention towards compact organs and reduce irrelevant background influence.
An auxiliary FCN segmentation head with Dice loss is employed to reinforce foreground learning, mitigate background distractions, and accelerate model convergence.

Offset-Adjusted Mask2Former is a transformer-based segmentation framework designed to enhance accuracy for mid-sized and small organ segmentation in medical images. Building upon Mask2Former with deformable attention modules, this approach introduces offset adjustment strategies, leverages the fourth CNN feature map for a coarse organ location prior, and adds a fully convolutional network (FCN) auxiliary head with Dice loss. These architectural innovations specifically address the unreliable sampling patterns and convergence challenges encountered when segmenting small, compact anatomical structures using generic transformer architectures (Zhang et al., 6 Jun 2025).

1. Baseline Framework and Deformable Attention

The foundation of Offset-Adjusted Mask2Former is Mask2Former, which applies multi-scale CNN backbone features to transformer decoding for universal segmentation tasks. Instead of prohibitive dense attention over all $H \times W$ pixels, Mask2Former uses the deformable attention mechanism introduced in Deformable DETR, where each query $q \in \mathbb{R}^C$ is parameterized by $H$ heads, $L$ feature levels, and $K$ sampling points per head/level.

The multi-scale deformable attention output for a query $q$ at spatial reference $p_q$ is:

$y_q = \mathrm{MSDeformAttn}(q, \{F_l\}) = \sum_{h=1}^{H} \sum_{l=1}^{L} \sum_{k=1}^{K} \alpha_{h,l,k}(q) \, W_v F_l(p_q + \Delta p_{h,l,k}(q))$

Here:

$F_l \in \mathbb{R}^{C \times H_l \times W_l}$ is the $l$ -th feature map from the CNN backbone.
$q \in \mathbb{R}^C$ 0 is the learned offset.
$q \in \mathbb{R}^C$ 1 is the attention weight.
$q \in \mathbb{R}^C$ 2 projects the sampled feature into the decoder embedding space.

This approach reduces self- and cross-attention complexity from $q \in \mathbb{R}^C$ 3 to $q \in \mathbb{R}^C$ 4, making the framework tractable for large-scale and 2D/3D hybrid medical image inputs.

2. Offset Adjustment Strategies for Compact Organ Segmentation

Naïve offset sampling in Mask2Former, unconstrained, often results in queries attending to irrelevant background for small organ regions. Offset-Adjusted Mask2Former introduces three per-point algebraic strategies to constrain the learned raw offsets $q \in \mathbb{R}^C$ 5 before use:

Threshold Clipping: For each offset vector $q \in \mathbb{R}^C$ 6,

$q \in \mathbb{R}^C$ 7

with threshold $q \in \mathbb{R}^C$ 8 and divisor $q \in \mathbb{R}^C$ 9.

Softmax Retraction:

$H$ 0

so that larger offset magnitudes are down-weighted.

Scaled Softmax (Best in practice):

$H$ 1

where $H$ 2 is the softmax weight from above and scale $H$ 3 (empirically, $H$ 4 yields the best results).

In all cases, the adjusted $H$ 5 replaces the default offset in deformable attention. The third strategy ("Sigmoid*2") provides optimal convergence and segmentation quality for compact anatomical targets.

3. Fourth-Stage Feature Map as Coarse Location Prior

Conventional Mask2Former uses only the first three CNN stages (feature levels $H$ 6) as encoder-memory, discarding the fourth, deepest feature map. Offset-Adjusted Mask2Former incorporates the fourth-stage feature $H$ 7 to generate a coarse spatial prior distinguishing organ from background:

$H$ 8 is processed by two $H$ 9 convolutional layers (with ReLU activations) to produce $L$ 0.
$L$ 1 is flattened and concatenated with the memory tokens from $L$ 2.
In the decoder, MSDeformAttn is extended so that, after standard output $L$ 3, a secondary output $L$ 4 is computed by attending to $L$ 5 using level $L$ 6 offsets and weights:

$L$ 7

with $L$ 8 by default.

This enhances query attention toward likely-organ regions, especially beneficial for compact structures.

4. Auxiliary FCN Head and Dice Loss Integration

To further mitigate background distraction and accelerate training, Offset-Adjusted Mask2Former adds a lightweight FCN segmentation head above $L$ 9:

Architecture: Two $K$ 0 convolutional layers, projecting $K$ 1 channels, with bilinear upsampling to match input dimensions.
Output: Coarse $K$ 2-way segmentation masks (including background).
Loss: Class-wise Dice loss,

$K$ 3

for classes $K$ 4, with $K$ 5.

The final loss sums the standard Mask2Former objective (per-query class/mask losses) and weighted auxiliary Dice-supervised loss:

$K$ 6

where $K$ 7.

This auxiliary pathway both constrains the main transformer and directly reinforces learning from likely foreground.

5. Training Protocols, Efficiency, and Implementation Details

Key implementation aspects include:

Datasets: HaN-Seg (33 CT for validation, 42 CT+MR for testing); SegRap2023 (100 train, 7 val, 10 test CT).
Preprocessing (“three-channel trick”, SegRap2023): Stack the original slice, a $K$ 8 upsample, and a $K$ 9 downsample to form input channels.
Backbone: ResNet-50, extracting stages $q$ 0.
Batch Size/GPU: 2 per GPU, 8 $q$ 1 RTX 4090.
Optimization: Adam, learning rate $q$ 2, weight decay $q$ 3.
Schedule: 40k iteration warmup, 120k total iterations.
Resource Efficiency: Deformable attention yields $q$ 4 speed and memory improvement over dense attention ( $q$ 5 vs $q$ 6), enabling affordable 2D–3D hybridization on constrained hardware.

6. Quantitative Performance and Ablation Studies

The following summarizes the evaluation on benchmark datasets:

Dataset/Metric	Baseline (nnU-Net)	Naïve Mask2Former	Offset-Adjusted Mask2Former
HaN-Seg (35 CT, mDice)	58.69	—	72.26
HaN-Seg (42 CT+MR, mDice)	—	—	81.60
HaN-Seg (42 CT+MR, mIoU)	—	—	70.44
SegRap2023 (CT only, mDice)	84.65	84.18	87.77

On HaN-Seg, a gain of +13.6 Dice over nnU-Net and +0.35 Dice over prior SOTA (SegReg) was observed.
On SegRap2023, Offset-Adjusted Mask2Former outperformed previous top results (mean Dice 87.77 vs 84.65 for nnU-Net).
Ablations revealed the largest improvements arise from the combination of offset adjustment ("Sigmoid*2") and background location head(s). Qualitatively, the largest target-specific gains appeared for Cochlea and Optic Nerve, with consistent improvements for Mandible and Spinal Cord.

7. Full Inference Workflow

Coherent pseudocode captures the inference steps: $q$ 7 Key subroutines: bilinear sampling from feature maps and spatial projection of query reference points.

8. Context and Implications

Offset-Adjusted Mask2Former demonstrates state-of-the-art performance on two prominent datasets (HaN-Seg and SegRap2023), especially for mid-sized and small structures where standard Transformer-based methods typically underperform. By algebraically adjusting the deformable offsets, integrating deeper semantic features, and guiding convergence with auxiliary Dice losses, it addresses the chief limitations of previous architectures that relied solely on unconstrained attention or local CNN features. Performance gains are concentrated in anatomically challenging regions and compact organs, suggesting that offset-constrained attention effectively integrates fine-scale foreground context while avoiding background confusion (Zhang et al., 6 Jun 2025).

A plausible implication is that algebraic regularization of attention sampling points can serve as a generic principle for transformer-based segmentation models in domains where compact foregrounds predominate and class imbalance is severe.

Markdown Report Issue Upgrade to Chat

References (1)

Query Nearby: Offset-Adjusted Mask2Former enhances small-organ segmentation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Offset-Adjusted Mask2Former.