- The paper introduces TGIF2โa dataset with 270K+ manipulated imagesโto expose forensic vulnerabilities in text-guided inpainting techniques.
- It details a robust pipeline integrating semantic and random masks with both spliced and fully regenerated variants across multiple inpainting models.
- Forensic evaluations reveal high detection for spliced forgeries but significant performance drops on fully regenerated images and under super-resolution attacks.
TGIF2: Advancements in Text-Guided Inpainting Forgery Datasets and Forensic Benchmarking
Introduction
Text-guided inpaintingโenabled by advanced generative AI diffusion modelsโhas become a dominant paradigm for high-fidelity, semantically controlled image editing. These capabilities, previously requiring technical proficiency, are now accessible to non-experts through prompt-driven interfaces in models such as Stable Diffusion (SD2/SDXL), Adobe Firefly, and FLUX.1. However, such democratization of manipulation intensifies threats to multimedia forensics. Image Forgery Localization (IFL) and Synthetic Image Detection (SID) pipelines, which once withstood simple splicing and classic inpainting, are now failing silently against state-of-the-art regenerative workflows. The "TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark" (2603.28613) introduces an extended dataset and experimental protocol to systematically expose and analyze these vulnerabilities.





Figure 1: Illustration of authentic image inpainting using different regeneration and splicing strategies with mask-based editing; zoom-in (e) and (f) highlights subtle, non-semantic variations in regenerated pixels.
Dataset Design and Technical Extensions
TGIF2 extends the original TGIF corpus with three critical innovations: inclusion of the FLUX.1 family of transformer-based and flow-matching inpainting models, the introduction of random, non-semantic masks, and the systematic generation of both spliced (SP) and fully regenerated (FR) variants for each manipulation. The dataset is sourced from MS-COCO and covers 19 subsets with over 270,000 manipulated images, comprising high-resolution content (up to 1024ร1024 px) and spanning both open-source (SD2, SDXL, FLUX.1) and commercial (Photoshop/Firefly) generators.
The dataset generation pipeline includes:
- Mask selection via semantic segmentation, bounding box, or random rectangular region,
- Prompt construction from object category and image caption,
- Parameter randomization (inference steps, guidance scale, random seeds),
- Generation of both SP and FR images per inpainting model and mask,
- Quantitative annotation with NIMA, GIQA, BLIP ITM, and preservation fidelity metrics (PSNR, SSIM, LPIPS).







Figure 2: Authentic image, mask selection, and output comparison for all six inpainting modelsโSD2, SDXL, Adobe Firefly, FLUX.1 schnell, dev, and Fill devโdemonstrate diverse inpainting characteristics.




Figure 3: Visual comparison between bounding box, segmentation, and random masks, highlighting the impact of semantic and non-semantic editing strategies.
TGIF2 surpasses contemporaneous datasets (SAGI, COinCO, GRE, BtB, GIM, BR-Gen, OpenSDI) by uniquely combining: (i) support for transformer-based inpainting, (ii) explicit SP/FR differentiation, and (iii) integration of non-object-based (random) masks.
Forensic Evaluation and Model Robustness
Image Forgery Localization (IFL) Findings
Systematic benchmarking on the expanded dataset reveals two key phenomena. First, IFL methods (e.g., CAT-Net, MMFusion, TruFor) are effective on spliced manipulationsโespecially when manipulations are spatially aligned with objects. However, they fail almost completely (F1 < 0.5) on FR images, regardless of the generative model. Second, on random-mask subsets, some top-performers exhibit a substantial performance drop, evidencing a reliance on object/semantic biases rather than truly generalizable forensic evidence.
Fine-tuning IFL models on FR data (across SD2, SDXL, FLUX.1) enables partial recovery of localization capability (F1 up to ~0.75), but exclusive semantic-mask training induces strong localization bias toward objects and poor detection on random masks. Incorporating both semantic and non-semantic FR data in fine-tuning achieves moderate robustness but generalization outside the training family (cross-model) remains poor.
(Figure 4)
Figure 4: Mask selection and model heatmapsโfinetuned TruFor localizes semantic forgery correctly, but misses non-object-aligned forgeries, illustrating semantic bias.
SID Method Evaluation
Evaluation of 17 state-of-the-art SID approaches demonstrates excellent detection of FR images for prior-generation diffusion models (e.g., B-Free, PatchCraft, RINE-ITW, SPAI: AUC > 0.9 on SD2/SDXL). However, FLUX.1-based forgeries erode detection rates (AUC drops to 0.7โ0.85 for strong models), indicating reduced or shifted generation artifacts. No meaningful performance difference is found between semantic and random-mask variants; the SID task remains strictly binary with no localization ability.
Super-Resolution Attacks
Application of generative super-resolution (Real-ESRGAN) reveals that such post-processing can erase or obscure forensic evidence: IFL F1 scores collapse (drops of 0.5โ0.8), and SID AUC degrades in most top-performers. Bicubic interpolation, in contrast, has negligible effect, pinpointing the destructive potential of learned restoration.
Quantitative Insights and Metric Correlation Analysis
Across subsets, generative quality scores (NIMA, GIQA, ITM, preservation fidelity) do not consistently correlate with forensic performance at the subset level; per-image analysis weakly associates higher aesthetic quality with increased localization F1, but the effect is inconsistent. Notably, FLUX.1 FR imagesโalthough higher quality (per GIQA/NIMA)โare more detectable in fine-tuned settings than SD2, suggesting that perceptual quality, as measured by current assessment tools, does not directly inform forensic vulnerability.
Implications and Research Trajectory
Practical and Theoretical Implications:
TGIF2โs results demonstrate that both semantic bias and model overfitting are endemic to current IFL/SID strategies. The inability of fine-tuned IFL models to generalize to unseen generative models and non-semantic localization tasks underscores the need for bias-minimized training and evaluation protocols. The degradation of SID and IFL performance under post-hoc enhancement (SR) suggests a fundamental inadequacy of forensic cues currently exploited. As generative models (e.g., FLUX.1, forthcoming FLUX Kontext/FLUX.2, GPT-4o Image Generator) further abstract or erase generative footprints, forensic pipelines must hybridize semantic and low-level signal analysis.
TGIF2โs random-mask protocol provides a critical diagnostic test for semantic overfitting, while the inclusion of new-generation models ensures benchmarks do not become stale as diffusion technologies rapidly iterate.
Shortcomings and Future Directions:
Current forensic evaluation is limited by metric selection (standard quality metrics like NIMA/GIQA are not region-aware), generator diversity, and by not yet incorporating fully mask-free generative editors. Future iterations should address intra-image quality contrast, develop region-aware or transformer-based bias correction methods, and expand to contextually masked/auto-masked generative paradigms.
Conclusion
TGIF2 (2603.28613) is a critical resource for advancing text-guided inpainting forgery detection and localization. By exposing and quantifying generalization failures, semantic biases, and the vulnerability of forensics to post-processing, it presents practitioners and researchers both a formidable benchmark and an open challenge: developing robust, generalizable, and bias-resistant forensic methods that can keep pace with rapidly advancing generative AI.