Papers
Topics
Authors
Recent
Search
2000 character limit reached

TwoHead-SwinFPN: Dual-Task Document Manipulation Detector

Updated 26 January 2026
  • The paper introduces TwoHead-SwinFPN, a unified architecture that leverages a Swin Transformer, FPN, and CBAM for joint detection and pixel-level segmentation of document manipulations.
  • It employs uncertainty-weighted multi-task learning to balance binary classification and segmentation, achieving robust cross-domain generalization across languages and devices.
  • Key experiments on the FantasyIDiap dataset reveal competitive performance with 84.31% accuracy and 90.78% AUC, effectively addressing face swapping and text inpainting attacks.

TwoHead-SwinFPN is a unified deep learning architecture designed for the simultaneous detection and localization of synthetic manipulations in identity documents, with a specific focus on face swapping and text inpainting attacks. The model leverages a @@@@1@@@@, @@@@3@@@@ (FPN), and a UNet-style decoder enhanced by Convolutional Block Attention Module (CBAM). A dual-head structure enables joint binary classification and pixel-level segmentation, optimizing both tasks through uncertainty-weighted multi-task learning. Empirical evaluation on the FantasyIDiap dataset across 10 languages and 3 acquisition devices demonstrates robust cross-domain generalization and competitive performance metrics (Naseeb et al., 19 Jan 2026).

1. Architectural Foundation

TwoHead-SwinFPN is architected for joint classification and segmentation, integrating several advanced components:

Swin Transformer Backbone

The Swin-Large variant constitutes the backbone, processing inputs IRH×W×3\mathbf{I}\in\mathbb{R}^{H\times W\times 3} and generating a deep hierarchical feature map: F={f0,f1,f2,f3}=SwinBackbone(I),\mathbf{F}=\{\mathbf{f}_0,\mathbf{f}_1,\mathbf{f}_2,\mathbf{f}_3\} = \mathrm{SwinBackbone}(\mathbf{I}), where f0RH4×W4×192\mathbf{f}_0\in\mathbb{R}^{\frac{H}{4}\times\frac{W}{4}\times192} up to f3RH32×W32×1536\mathbf{f}_3\in\mathbb{R}^{\frac{H}{32}\times\frac{W}{32}\times1536}.

The shifted-window self-attention mechanism realizes both intra- and inter-window dependencies, strengthening hierarchical visual representations.

Feature Pyramid Network (FPN)

Multi-scale fusion is achieved by applying FPN to F\mathbf{F}: P={p0,p1,p2,p3}=FPN(F),\mathbf{P} = \{\mathbf{p}_0, \mathbf{p}_1, \mathbf{p}_2, \mathbf{p}_3\} = \mathrm{FPN}(\mathbf{F}), fixing the output of each pyramid level to 256 channels, facilitating simultaneous recognition of coarse and fine manipulation (e.g., face swaps and text inpainting).

CBAM-Enhanced UNet Decoder

Decoder blocks feature the sequence: 3×33\times3 convolution, BatchNorm, ReLU, channel and spatial attention (CBAM), and up-sampling. CBAM refines intermediate activations via both channel and spatial attention: xout=Ms(Mcx),\mathbf{x}_{\text{out}} = \mathbf{M}_s \odot (\mathbf{M}_c \odot \mathbf{x}), helping the network emphasize regions critical to manipulation detection and localization.

2. Dual-Head Structure: Detection & Localization

The model bifurcates at the output into two specialized heads:

Detection Head

Operates on f3\mathbf{f}_3 and comprises the following:

Layer Operation Output Shape
Conv1×1_{1\times1} 1536 → 1 channel (1,H/32,W/32)(1, H/32, W/32)
Dropout (p=0.5p=0.5) (1,H/32,W/32)(1, H/32, W/32)
Global Avg Pool (1,H/32,W/32)(1)(1, H/32, W/32)\rightarrow (1) (1)(1)
Sigmoid p(0,1)p \in (0,1)

The scalar output is interpreted as the probability that the input is manipulated.

Segmentation Head

Processes fused pyramid features P\mathbf{P} through UNet-like up-sampling blocks:

  • Outputs a 1×512×5121\times512\times512 soft mask via 1×1 convolution and sigmoid activation, indicating regions of synthetic modification.

3. Multi-Task Learning and Optimization

Both heads are optimized under a unified training objective:

Focal Loss for Classification

Ldet=α(1pt)γlog(pt),\mathcal{L}_{\mathrm{det}} = -\alpha (1-p_t)^\gamma \log(p_t),

where ptp_t is the probability assigned to the true class.

Segmentation Loss

Segmented outputs receive a compound loss: Lseg=wmainLdice+wauxLaux+wboundLboundary,\mathcal{L}_{\mathrm{seg}} = w_{\mathrm{main}}\mathcal{L}_{\mathrm{dice}} + w_{\mathrm{aux}}\mathcal{L}_{\mathrm{aux}} + w_{\mathrm{bound}}\mathcal{L}_{\mathrm{boundary}}, with primary Dice loss defined as

Ldice=12ipigi+ϵipi+igi+ϵ.\mathcal{L}_{\mathrm{dice}} = 1 - \frac{2\sum_i p_i g_i + \epsilon}{\sum_i p_i + \sum_i g_i + \epsilon}.

Uncertainty-Weighted Loss Aggregation

Global optimization uses learnable task uncertainties σdet\sigma_{\mathrm{det}} and σseg\sigma_{\mathrm{seg}}: Ltotal=12σdet2Ldet+logσdet+12σseg2Lseg+logσseg\mathcal{L}_{\mathrm{total}} = \frac{1}{2\sigma_{\mathrm{det}}^2} \mathcal{L}_{\mathrm{det}} + \log\sigma_{\mathrm{det}} + \frac{1}{2\sigma_{\mathrm{seg}}^2} \mathcal{L}_{\mathrm{seg}} + \log\sigma_{\mathrm{seg}} These terms self-adjust during gradient descent, automatically balancing between detection and segmentation, with loss contributions down-weighted as uncertainty rises.

4. Empirical Setup and Data

FantasyIDiap Dataset Characteristics

  • 2,358 total images: 786 bona fide, 1,572 manipulated.
  • 10 languages (stratified, e.g., Turkish 13.0%, Russian 7.6%).
  • 3 acquisition devices: Huawei Mate 30, iPhone 15 Pro, and high-resolution scanner.
  • Manipulation types: digital_1 (face swap), digital_2 (text inpainting).
  • Images normalized and resized to 512×512512\times512.

Splits and Augmentation

  • 70/15/15 train/validation/test ratio; stratified by class, language, device.
  • Augmentation via Albumentations: photometric (±30% brightness/contrast, HSV/RGB jitter), compression (JPEG 60–100%, Gaussian blur, noise), geometric (horizontal flip, 90° rotation, elastic, perspective), MixUp (β(0.4, 0.4) at 50%).

5. Implementation and Deployment

Technical Model Profile

  • PyTorch, mixed-precision training, gradient clipping (max norm 1.0).
  • Model size: 180M parameters, ≈180 MB disk.
  • Inference: 198 ms/image (Tesla V100), 2.1 s/image (Xeon CPU).

Service Endpoints via FastAPI

Endpoint Name Output
/detect manipulation probability
/localize manipulation mask
/detect_and_localize both outputs

These endpoints are suited for scalable real-world deployment, with sub-second GPU inference performance.

6. Experimental Results and Comparative Analysis

Test-Set Evaluation: FantasyIDiap

Metric Value
Accuracy 84.31 %
AUC 90.78 %
F1-Score 88.61 %
Avg. Precision (AP) 95.13 %
Mean Dice Score 57.24 %
Mean IoU 50.77 %
Dice Std. Dev. 41.09 %
Optimal Seg. Threshold 0.10

Ablation Study: Backbone & Components

Configuration Acc F1 Dice ΔAcc
ResNet-50 Baseline 78.2% 82.1% 48.3%
+ Swin Transformer 81.5% 85.2% 52.1% +3.3%
+ Feature Pyramid Network 82.8% 86.7% 54.6% +1.3%
+ CBAM Attention 83.9% 87.8% 56.2% +1.1%
+ Multi-task Learning 84.3% 88.6% 57.2% +0.4%

Loss Weighting Impact

Approach Acc F1 Dice
Fixed 0.5/0.5 weights 82.1% 86.3% 54.8%
Manual grid-search weights 83.7% 87.9% 56.1%
Uncertainty weighting 84.3% 88.6% 57.2%

Cross-Device and Cross-Language Generalization

Device performance:

Device Accuracy F1-Score
Huawei 85.2 % 89.5 %
iPhone 84.1 % 88.2 %
Scanner 83.8 % 87.9 %

Top-5 language performance:

Language Samples Accuracy F1-Score
Turkish 306 85.1 % 89.2 %
Chinese 288 84.7 % 88.9 %
Portuguese 261 83.9 % 88.1 %
English 252 84.2 % 88.5 %
French 243 83.8 % 87.8 %
Average 270 84.3 % 88.5 %

7. Contributions, Limitations, and Prospects

Key Achievements

  • The architecture jointly leverages Swin Transformer, FPN, and CBAM under a dual-head paradigm for effective manipulation detection and localization.
  • Uncertainty-weighted multi-task loss facilitates adaptive balancing between classification and segmentation supervision.
  • FantasylDIiap benchmark validates generalization across languages, devices, and manipulation types.

Strengths

  • High accuracy (84.31%), AUC (90.78%), and F1 (88.61%) on binary manipulation detection.
  • Significant Dice score (>0.9) on clear, large manipulations.
  • Demonstrated cross-lingual and cross-device robustness.

Limitations

  • Moderate average Dice (57.24%) with substantial variance, indicating challenges in localizing subtle text inpainting attacks.
  • CPU inference speed (2.1 s/image) may not meet real-time constraints.
  • Unverified generalization to alternative manipulation types (e.g., neural style transfer).

Future Directions

  • Frequency-domain feature extraction (DCT, wavelet analysis) for improved artifact detection.
  • Adversarial training to increase resilience against adaptive synthetic attacks.
  • Pruning, quantization, or distillation for edge device deployment.
  • Incorporation of advanced attention mechanisms (e.g., deformable, multi-head cross-attention) to enhance fine-grained localization.
  • Evaluation under cross-dataset and video-based protocols to assess transferability and temporal consistency.

TwoHead-SwinFPN thus synthesizes state-of-the-art vision transformer methods, attention modules, and robust multi-task optimization for practical and research-oriented document integrity analysis (Naseeb et al., 19 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TwoHead-SwinFPN.