TwoHead-SwinFPN: Dual-Task Document Manipulation Detector

Updated 26 January 2026

The paper introduces TwoHead-SwinFPN, a unified architecture that leverages a Swin Transformer, FPN, and CBAM for joint detection and pixel-level segmentation of document manipulations.
It employs uncertainty-weighted multi-task learning to balance binary classification and segmentation, achieving robust cross-domain generalization across languages and devices.
Key experiments on the FantasyIDiap dataset reveal competitive performance with 84.31% accuracy and 90.78% AUC, effectively addressing face swapping and text inpainting attacks.

TwoHead-SwinFPN is a unified deep learning architecture designed for the simultaneous detection and localization of synthetic manipulations in identity documents, with a specific focus on face swapping and text inpainting attacks. The model leverages a @@@@1@@@@, @@@@3@@@@ (FPN), and a UNet-style decoder enhanced by Convolutional Block Attention Module (CBAM). A dual-head structure enables joint binary classification and pixel-level segmentation, optimizing both tasks through uncertainty-weighted multi-task learning. Empirical evaluation on the FantasyIDiap dataset across 10 languages and 3 acquisition devices demonstrates robust cross-domain generalization and competitive performance metrics (Naseeb et al., 19 Jan 2026).

1. Architectural Foundation

TwoHead-SwinFPN is architected for joint classification and segmentation, integrating several advanced components:

Swin Transformer Backbone

The Swin-Large variant constitutes the backbone, processing inputs $\mathbf{I}\in\mathbb{R}^{H\times W\times 3}$ and generating a deep hierarchical feature map: $\mathbf{F}=\{\mathbf{f}_0,\mathbf{f}_1,\mathbf{f}_2,\mathbf{f}_3\} = \mathrm{SwinBackbone}(\mathbf{I}),$ where $\mathbf{f}_0\in\mathbb{R}^{\frac{H}{4}\times\frac{W}{4}\times192}$ up to $\mathbf{f}_3\in\mathbb{R}^{\frac{H}{32}\times\frac{W}{32}\times1536}$ .

The shifted-window self-attention mechanism realizes both intra- and inter-window dependencies, strengthening hierarchical visual representations.

Feature Pyramid Network (FPN)

Multi-scale fusion is achieved by applying FPN to $\mathbf{F}$ : $\mathbf{P} = \{\mathbf{p}_0, \mathbf{p}_1, \mathbf{p}_2, \mathbf{p}_3\} = \mathrm{FPN}(\mathbf{F}),$ fixing the output of each pyramid level to 256 channels, facilitating simultaneous recognition of coarse and fine manipulation (e.g., face swaps and text inpainting).

CBAM-Enhanced UNet Decoder

Decoder blocks feature the sequence: $3\times3$ convolution, BatchNorm, ReLU, channel and spatial attention (CBAM), and up-sampling. CBAM refines intermediate activations via both channel and spatial attention: $\mathbf{x}_{\text{out}} = \mathbf{M}_s \odot (\mathbf{M}_c \odot \mathbf{x}),$ helping the network emphasize regions critical to manipulation detection and localization.

2. Dual-Head Structure: Detection & Localization

The model bifurcates at the output into two specialized heads:

Detection Head

Operates on $\mathbf{f}_3$ and comprises the following:

Layer	Operation	Output Shape
Conv $_{1\times1}$	1536 → 1 channel	$(1, H/32, W/32)$
Dropout ( $p=0.5$ )	—	$(1, H/32, W/32)$
Global Avg Pool	$(1, H/32, W/32)\rightarrow (1)$	$(1)$
Sigmoid	—	$p \in (0,1)$

The scalar output is interpreted as the probability that the input is manipulated.

Segmentation Head

Processes fused pyramid features $\mathbf{P}$ through UNet-like up-sampling blocks:

Outputs a $1\times512\times512$ soft mask via 1×1 convolution and sigmoid activation, indicating regions of synthetic modification.

3. Multi-Task Learning and Optimization

Both heads are optimized under a unified training objective:

Focal Loss for Classification

$\mathcal{L}_{\mathrm{det}} = -\alpha (1-p_t)^\gamma \log(p_t),$

where $p_t$ is the probability assigned to the true class.

Segmentation Loss

Segmented outputs receive a compound loss: $\mathcal{L}_{\mathrm{seg}} = w_{\mathrm{main}}\mathcal{L}_{\mathrm{dice}} + w_{\mathrm{aux}}\mathcal{L}_{\mathrm{aux}} + w_{\mathrm{bound}}\mathcal{L}_{\mathrm{boundary}},$ with primary Dice loss defined as

$\mathcal{L}_{\mathrm{dice}} = 1 - \frac{2\sum_i p_i g_i + \epsilon}{\sum_i p_i + \sum_i g_i + \epsilon}.$

Uncertainty-Weighted Loss Aggregation

Global optimization uses learnable task uncertainties $\sigma_{\mathrm{det}}$ and $\sigma_{\mathrm{seg}}$ : $\mathcal{L}_{\mathrm{total}} = \frac{1}{2\sigma_{\mathrm{det}}^2} \mathcal{L}_{\mathrm{det}} + \log\sigma_{\mathrm{det}} + \frac{1}{2\sigma_{\mathrm{seg}}^2} \mathcal{L}_{\mathrm{seg}} + \log\sigma_{\mathrm{seg}}$ These terms self-adjust during gradient descent, automatically balancing between detection and segmentation, with loss contributions down-weighted as uncertainty rises.

4. Empirical Setup and Data

FantasyIDiap Dataset Characteristics

2,358 total images: 786 bona fide, 1,572 manipulated.
10 languages (stratified, e.g., Turkish 13.0%, Russian 7.6%).
3 acquisition devices: Huawei Mate 30, iPhone 15 Pro, and high-resolution scanner.
Manipulation types: digital_1 (face swap), digital_2 (text inpainting).
Images normalized and resized to $512\times512$ .

Splits and Augmentation

70/15/15 train/validation/test ratio; stratified by class, language, device.
Augmentation via Albumentations: photometric (±30% brightness/contrast, HSV/RGB jitter), compression (JPEG 60–100%, Gaussian blur, noise), geometric (horizontal flip, 90° rotation, elastic, perspective), MixUp (β(0.4, 0.4) at 50%).

5. Implementation and Deployment

Technical Model Profile

PyTorch, mixed-precision training, gradient clipping (max norm 1.0).
Model size: 180M parameters, ≈180 MB disk.
Inference: 198 ms/image (Tesla V100), 2.1 s/image (Xeon CPU).

Service Endpoints via FastAPI

Endpoint Name	Output
`/detect`	manipulation probability
`/localize`	manipulation mask
`/detect_and_localize`	both outputs

These endpoints are suited for scalable real-world deployment, with sub-second GPU inference performance.

6. Experimental Results and Comparative Analysis

Test-Set Evaluation: FantasyIDiap

Metric	Value
Accuracy	84.31 %
AUC	90.78 %
F1-Score	88.61 %
Avg. Precision (AP)	95.13 %
Mean Dice Score	57.24 %
Mean IoU	50.77 %
Dice Std. Dev.	41.09 %
Optimal Seg. Threshold	0.10

Ablation Study: Backbone & Components

Configuration	Acc	F1	Dice	ΔAcc
ResNet-50 Baseline	78.2%	82.1%	48.3%	—
+ Swin Transformer	81.5%	85.2%	52.1%	+3.3%
+ Feature Pyramid Network	82.8%	86.7%	54.6%	+1.3%
+ CBAM Attention	83.9%	87.8%	56.2%	+1.1%
+ Multi-task Learning	84.3%	88.6%	57.2%	+0.4%

Loss Weighting Impact

Approach	Acc	F1	Dice
Fixed 0.5/0.5 weights	82.1%	86.3%	54.8%
Manual grid-search weights	83.7%	87.9%	56.1%
Uncertainty weighting	84.3%	88.6%	57.2%

Cross-Device and Cross-Language Generalization

Device performance:

Device	Accuracy	F1-Score
Huawei	85.2 %	89.5 %
iPhone	84.1 %	88.2 %
Scanner	83.8 %	87.9 %

Top-5 language performance:

Language	Samples	Accuracy	F1-Score
Turkish	306	85.1 %	89.2 %
Chinese	288	84.7 %	88.9 %
Portuguese	261	83.9 %	88.1 %
English	252	84.2 %	88.5 %
French	243	83.8 %	87.8 %
Average	270	84.3 %	88.5 %

7. Contributions, Limitations, and Prospects

Key Achievements

The architecture jointly leverages Swin Transformer, FPN, and CBAM under a dual-head paradigm for effective manipulation detection and localization.
Uncertainty-weighted multi-task loss facilitates adaptive balancing between classification and segmentation supervision.
FantasylDIiap benchmark validates generalization across languages, devices, and manipulation types.

Strengths

High accuracy (84.31%), AUC (90.78%), and F1 (88.61%) on binary manipulation detection.
Significant Dice score (>0.9) on clear, large manipulations.
Demonstrated cross-lingual and cross-device robustness.

Limitations

Moderate average Dice (57.24%) with substantial variance, indicating challenges in localizing subtle text inpainting attacks.
CPU inference speed (2.1 s/image) may not meet real-time constraints.
Unverified generalization to alternative manipulation types (e.g., neural style transfer).

Future Directions

Frequency-domain feature extraction (DCT, wavelet analysis) for improved artifact detection.
Adversarial training to increase resilience against adaptive synthetic attacks.
Pruning, quantization, or distillation for edge device deployment.
Incorporation of advanced attention mechanisms (e.g., deformable, multi-head cross-attention) to enhance fine-grained localization.
Evaluation under cross-dataset and video-based protocols to assess transferability and temporal consistency.

TwoHead-SwinFPN thus synthesizes state-of-the-art vision transformer methods, attention modules, and robust multi-task optimization for practical and research-oriented document integrity analysis (Naseeb et al., 19 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

TwoHead-SwinFPN: A Unified DL Architecture for Synthetic Manipulation, Detection and Localization in Identity Documents (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TwoHead-SwinFPN.