TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark

Published 30 Mar 2026 in cs.CV, cs.AI, cs.CR, and cs.MM | (2603.28613v1)

Abstract: Generative AI has made text-guided inpainting a powerful image editing tool, but at the same time a growing challenge for media forensics. Existing benchmarks, including our text-guided inpainting forgery (TGIF) dataset, show that image forgery localization (IFL) methods can localize manipulations in spliced images but struggle not in fully regenerated (FR) images, while synthetic image detection (SID) methods can detect fully regenerated images but cannot perform localization. With new generative inpainting models emerging and the open problem of localization in FR images remaining, updated datasets and benchmarks are needed. We introduce TGIF2, an extended version of TGIF, that captures recent advances in text-guided inpainting and enables a deeper analysis of forensic robustness. TGIF2 augments the original dataset with edits generated by FLUX.1 models, as well as with random non-semantic masks. Using the TGIF2 dataset, we conduct a forensic evaluation spanning IFL and SID, including fine-tuning IFL methods on FR images and generative super-resolution attacks. Our experiments show that both IFL and SID methods degrade on FLUX.1 manipulations, highlighting limited generalization. Additionally, while fine-tuning improves localization on FR images, evaluation with random non-semantic masks reveals object bias. Furthermore, generative super-resolution significantly weakens forensic traces, demonstrating that common image enhancement operations can undermine current forensic pipelines. In summary, TGIF2 provides an updated dataset and benchmark, which enables new insights into the challenges posed by modern inpainting and AI-based image enhancements. TGIF2 is available at https://github.com/IDLabMedia/tgif-dataset.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces TGIF2—a dataset with 270K+ manipulated images—to expose forensic vulnerabilities in text-guided inpainting techniques.
It details a robust pipeline integrating semantic and random masks with both spliced and fully regenerated variants across multiple inpainting models.
Forensic evaluations reveal high detection for spliced forgeries but significant performance drops on fully regenerated images and under super-resolution attacks.

TGIF2: Advancements in Text-Guided Inpainting Forgery Datasets and Forensic Benchmarking

Introduction

Text-guided inpainting—enabled by advanced generative AI diffusion models—has become a dominant paradigm for high-fidelity, semantically controlled image editing. These capabilities, previously requiring technical proficiency, are now accessible to non-experts through prompt-driven interfaces in models such as Stable Diffusion (SD2/SDXL), Adobe Firefly, and FLUX.1. However, such democratization of manipulation intensifies threats to multimedia forensics. Image Forgery Localization (IFL) and Synthetic Image Detection (SID) pipelines, which once withstood simple splicing and classic inpainting, are now failing silently against state-of-the-art regenerative workflows. The "TGIF2: Extended Text-Guided Inpainting Forgery Dataset & Benchmark" (2603.28613) introduces an extended dataset and experimental protocol to systematically expose and analyze these vulnerabilities.

Figure 1: Illustration of authentic image inpainting using different regeneration and splicing strategies with mask-based editing; zoom-in (e) and (f) highlights subtle, non-semantic variations in regenerated pixels.

Dataset Design and Technical Extensions

TGIF2 extends the original TGIF corpus with three critical innovations: inclusion of the FLUX.1 family of transformer-based and flow-matching inpainting models, the introduction of random, non-semantic masks, and the systematic generation of both spliced (SP) and fully regenerated (FR) variants for each manipulation. The dataset is sourced from MS-COCO and covers 19 subsets with over 270,000 manipulated images, comprising high-resolution content (up to 1024×1024 px) and spanning both open-source (SD2, SDXL, FLUX.1) and commercial (Photoshop/Firefly) generators.

The dataset generation pipeline includes:

Mask selection via semantic segmentation, bounding box, or random rectangular region,
Prompt construction from object category and image caption,
Parameter randomization (inference steps, guidance scale, random seeds),
Generation of both SP and FR images per inpainting model and mask,
Quantitative annotation with NIMA, GIQA, BLIP ITM, and preservation fidelity metrics (PSNR, SSIM, LPIPS).

Figure 2: Authentic image, mask selection, and output comparison for all six inpainting models—SD2, SDXL, Adobe Firefly, FLUX.1 schnell, dev, and Fill dev—demonstrate diverse inpainting characteristics.

Figure 3: Visual comparison between bounding box, segmentation, and random masks, highlighting the impact of semantic and non-semantic editing strategies.

TGIF2 surpasses contemporaneous datasets (SAGI, COinCO, GRE, BtB, GIM, BR-Gen, OpenSDI) by uniquely combining: (i) support for transformer-based inpainting, (ii) explicit SP/FR differentiation, and (iii) integration of non-object-based (random) masks.

Forensic Evaluation and Model Robustness

Image Forgery Localization (IFL) Findings

Systematic benchmarking on the expanded dataset reveals two key phenomena. First, IFL methods (e.g., CAT-Net, MMFusion, TruFor) are effective on spliced manipulations—especially when manipulations are spatially aligned with objects. However, they fail almost completely (F1 < 0.5) on FR images, regardless of the generative model. Second, on random-mask subsets, some top-performers exhibit a substantial performance drop, evidencing a reliance on object/semantic biases rather than truly generalizable forensic evidence.

Fine-tuning IFL models on FR data (across SD2, SDXL, FLUX.1) enables partial recovery of localization capability (F1 up to ~0.75), but exclusive semantic-mask training induces strong localization bias toward objects and poor detection on random masks. Incorporating both semantic and non-semantic FR data in fine-tuning achieves moderate robustness but generalization outside the training family (cross-model) remains poor.

(Figure 4)

Figure 4: Mask selection and model heatmaps—finetuned TruFor localizes semantic forgery correctly, but misses non-object-aligned forgeries, illustrating semantic bias.

SID Method Evaluation

Evaluation of 17 state-of-the-art SID approaches demonstrates excellent detection of FR images for prior-generation diffusion models (e.g., B-Free, PatchCraft, RINE-ITW, SPAI: AUC > 0.9 on SD2/SDXL). However, FLUX.1-based forgeries erode detection rates (AUC drops to 0.7–0.85 for strong models), indicating reduced or shifted generation artifacts. No meaningful performance difference is found between semantic and random-mask variants; the SID task remains strictly binary with no localization ability.

Super-Resolution Attacks

Application of generative super-resolution (Real-ESRGAN) reveals that such post-processing can erase or obscure forensic evidence: IFL F1 scores collapse (drops of 0.5–0.8), and SID AUC degrades in most top-performers. Bicubic interpolation, in contrast, has negligible effect, pinpointing the destructive potential of learned restoration.

Quantitative Insights and Metric Correlation Analysis

Across subsets, generative quality scores (NIMA, GIQA, ITM, preservation fidelity) do not consistently correlate with forensic performance at the subset level; per-image analysis weakly associates higher aesthetic quality with increased localization F1, but the effect is inconsistent. Notably, FLUX.1 FR images—although higher quality (per GIQA/NIMA)—are more detectable in fine-tuned settings than SD2, suggesting that perceptual quality, as measured by current assessment tools, does not directly inform forensic vulnerability.

Implications and Research Trajectory

Practical and Theoretical Implications:

TGIF2’s results demonstrate that both semantic bias and model overfitting are endemic to current IFL/SID strategies. The inability of fine-tuned IFL models to generalize to unseen generative models and non-semantic localization tasks underscores the need for bias-minimized training and evaluation protocols. The degradation of SID and IFL performance under post-hoc enhancement (SR) suggests a fundamental inadequacy of forensic cues currently exploited. As generative models (e.g., FLUX.1, forthcoming FLUX Kontext/FLUX.2, GPT-4o Image Generator) further abstract or erase generative footprints, forensic pipelines must hybridize semantic and low-level signal analysis.

TGIF2’s random-mask protocol provides a critical diagnostic test for semantic overfitting, while the inclusion of new-generation models ensures benchmarks do not become stale as diffusion technologies rapidly iterate.

Shortcomings and Future Directions:

Current forensic evaluation is limited by metric selection (standard quality metrics like NIMA/GIQA are not region-aware), generator diversity, and by not yet incorporating fully mask-free generative editors. Future iterations should address intra-image quality contrast, develop region-aware or transformer-based bias correction methods, and expand to contextually masked/auto-masked generative paradigms.

Conclusion

TGIF2 (2603.28613) is a critical resource for advancing text-guided inpainting forgery detection and localization. By exposing and quantifying generalization failures, semantic biases, and the vulnerability of forensics to post-processing, it presents practitioners and researchers both a formidable benchmark and an open challenge: developing robust, generalizable, and bias-resistant forensic methods that can keep pace with rapidly advancing generative AI.

Markdown Report Issue