TwoHead-SwinFPN: Dual-Task Document Manipulation Detector
- The paper introduces TwoHead-SwinFPN, a unified architecture that leverages a Swin Transformer, FPN, and CBAM for joint detection and pixel-level segmentation of document manipulations.
- It employs uncertainty-weighted multi-task learning to balance binary classification and segmentation, achieving robust cross-domain generalization across languages and devices.
- Key experiments on the FantasyIDiap dataset reveal competitive performance with 84.31% accuracy and 90.78% AUC, effectively addressing face swapping and text inpainting attacks.
TwoHead-SwinFPN is a unified deep learning architecture designed for the simultaneous detection and localization of synthetic manipulations in identity documents, with a specific focus on face swapping and text inpainting attacks. The model leverages a @@@@1@@@@, @@@@3@@@@ (FPN), and a UNet-style decoder enhanced by Convolutional Block Attention Module (CBAM). A dual-head structure enables joint binary classification and pixel-level segmentation, optimizing both tasks through uncertainty-weighted multi-task learning. Empirical evaluation on the FantasyIDiap dataset across 10 languages and 3 acquisition devices demonstrates robust cross-domain generalization and competitive performance metrics (Naseeb et al., 19 Jan 2026).
1. Architectural Foundation
TwoHead-SwinFPN is architected for joint classification and segmentation, integrating several advanced components:
Swin Transformer Backbone
The Swin-Large variant constitutes the backbone, processing inputs and generating a deep hierarchical feature map: where up to .
The shifted-window self-attention mechanism realizes both intra- and inter-window dependencies, strengthening hierarchical visual representations.
Feature Pyramid Network (FPN)
Multi-scale fusion is achieved by applying FPN to : fixing the output of each pyramid level to 256 channels, facilitating simultaneous recognition of coarse and fine manipulation (e.g., face swaps and text inpainting).
CBAM-Enhanced UNet Decoder
Decoder blocks feature the sequence: convolution, BatchNorm, ReLU, channel and spatial attention (CBAM), and up-sampling. CBAM refines intermediate activations via both channel and spatial attention: helping the network emphasize regions critical to manipulation detection and localization.
2. Dual-Head Structure: Detection & Localization
The model bifurcates at the output into two specialized heads:
Detection Head
Operates on and comprises the following:
| Layer | Operation | Output Shape |
|---|---|---|
| Conv | 1536 → 1 channel | |
| Dropout () | — | |
| Global Avg Pool | ||
| Sigmoid | — |
The scalar output is interpreted as the probability that the input is manipulated.
Segmentation Head
Processes fused pyramid features through UNet-like up-sampling blocks:
- Outputs a soft mask via 1×1 convolution and sigmoid activation, indicating regions of synthetic modification.
3. Multi-Task Learning and Optimization
Both heads are optimized under a unified training objective:
Focal Loss for Classification
where is the probability assigned to the true class.
Segmentation Loss
Segmented outputs receive a compound loss: with primary Dice loss defined as
Uncertainty-Weighted Loss Aggregation
Global optimization uses learnable task uncertainties and : These terms self-adjust during gradient descent, automatically balancing between detection and segmentation, with loss contributions down-weighted as uncertainty rises.
4. Empirical Setup and Data
FantasyIDiap Dataset Characteristics
- 2,358 total images: 786 bona fide, 1,572 manipulated.
- 10 languages (stratified, e.g., Turkish 13.0%, Russian 7.6%).
- 3 acquisition devices: Huawei Mate 30, iPhone 15 Pro, and high-resolution scanner.
- Manipulation types: digital_1 (face swap), digital_2 (text inpainting).
- Images normalized and resized to .
Splits and Augmentation
- 70/15/15 train/validation/test ratio; stratified by class, language, device.
- Augmentation via Albumentations: photometric (±30% brightness/contrast, HSV/RGB jitter), compression (JPEG 60–100%, Gaussian blur, noise), geometric (horizontal flip, 90° rotation, elastic, perspective), MixUp (β(0.4, 0.4) at 50%).
5. Implementation and Deployment
Technical Model Profile
- PyTorch, mixed-precision training, gradient clipping (max norm 1.0).
- Model size: 180M parameters, ≈180 MB disk.
- Inference: 198 ms/image (Tesla V100), 2.1 s/image (Xeon CPU).
Service Endpoints via FastAPI
| Endpoint Name | Output |
|---|---|
/detect |
manipulation probability |
/localize |
manipulation mask |
/detect_and_localize |
both outputs |
These endpoints are suited for scalable real-world deployment, with sub-second GPU inference performance.
6. Experimental Results and Comparative Analysis
Test-Set Evaluation: FantasyIDiap
| Metric | Value |
|---|---|
| Accuracy | 84.31 % |
| AUC | 90.78 % |
| F1-Score | 88.61 % |
| Avg. Precision (AP) | 95.13 % |
| Mean Dice Score | 57.24 % |
| Mean IoU | 50.77 % |
| Dice Std. Dev. | 41.09 % |
| Optimal Seg. Threshold | 0.10 |
Ablation Study: Backbone & Components
| Configuration | Acc | F1 | Dice | ΔAcc |
|---|---|---|---|---|
| ResNet-50 Baseline | 78.2% | 82.1% | 48.3% | — |
| + Swin Transformer | 81.5% | 85.2% | 52.1% | +3.3% |
| + Feature Pyramid Network | 82.8% | 86.7% | 54.6% | +1.3% |
| + CBAM Attention | 83.9% | 87.8% | 56.2% | +1.1% |
| + Multi-task Learning | 84.3% | 88.6% | 57.2% | +0.4% |
Loss Weighting Impact
| Approach | Acc | F1 | Dice |
|---|---|---|---|
| Fixed 0.5/0.5 weights | 82.1% | 86.3% | 54.8% |
| Manual grid-search weights | 83.7% | 87.9% | 56.1% |
| Uncertainty weighting | 84.3% | 88.6% | 57.2% |
Cross-Device and Cross-Language Generalization
Device performance:
| Device | Accuracy | F1-Score |
|---|---|---|
| Huawei | 85.2 % | 89.5 % |
| iPhone | 84.1 % | 88.2 % |
| Scanner | 83.8 % | 87.9 % |
Top-5 language performance:
| Language | Samples | Accuracy | F1-Score |
|---|---|---|---|
| Turkish | 306 | 85.1 % | 89.2 % |
| Chinese | 288 | 84.7 % | 88.9 % |
| Portuguese | 261 | 83.9 % | 88.1 % |
| English | 252 | 84.2 % | 88.5 % |
| French | 243 | 83.8 % | 87.8 % |
| Average | 270 | 84.3 % | 88.5 % |
7. Contributions, Limitations, and Prospects
Key Achievements
- The architecture jointly leverages Swin Transformer, FPN, and CBAM under a dual-head paradigm for effective manipulation detection and localization.
- Uncertainty-weighted multi-task loss facilitates adaptive balancing between classification and segmentation supervision.
- FantasylDIiap benchmark validates generalization across languages, devices, and manipulation types.
Strengths
- High accuracy (84.31%), AUC (90.78%), and F1 (88.61%) on binary manipulation detection.
- Significant Dice score (>0.9) on clear, large manipulations.
- Demonstrated cross-lingual and cross-device robustness.
Limitations
- Moderate average Dice (57.24%) with substantial variance, indicating challenges in localizing subtle text inpainting attacks.
- CPU inference speed (2.1 s/image) may not meet real-time constraints.
- Unverified generalization to alternative manipulation types (e.g., neural style transfer).
Future Directions
- Frequency-domain feature extraction (DCT, wavelet analysis) for improved artifact detection.
- Adversarial training to increase resilience against adaptive synthetic attacks.
- Pruning, quantization, or distillation for edge device deployment.
- Incorporation of advanced attention mechanisms (e.g., deformable, multi-head cross-attention) to enhance fine-grained localization.
- Evaluation under cross-dataset and video-based protocols to assess transferability and temporal consistency.
TwoHead-SwinFPN thus synthesizes state-of-the-art vision transformer methods, attention modules, and robust multi-task optimization for practical and research-oriented document integrity analysis (Naseeb et al., 19 Jan 2026).