FantasyID Initiative: AI Video & Forgery Benchmark
- FantasyID Initiative is a dual-focused research area covering identity-preserving text-to-video generation and benchmarking forgery detection on digital ID documents.
- The framework employs a tuning-free diffusion approach that fuses 2D and 3D face features to ensure superior facial identity retention and dynamic video synthesis.
- Rigorous evaluations reveal that current forgery detectors struggle with advanced face-swap attacks, underscoring critical vulnerabilities in KYC systems.
The FantasyID initiative encompasses two distinct but thematically linked efforts in generative AI and digital security: (1) the FantasyID framework for identity-preserving text-to-video generation leveraging diffusion transformers and explicit face knowledge, and (2) the FantasyID dataset, designed as a public resource for benchmarking forgery detection on digitally manipulated ID documents. These contributions directly address emerging technological challenges: maintaining dynamic facial fidelity in generative video while preserving identity, and benchmarking the robustness of document forgery detectors in the era of advanced generative attacks. Both lines of research present rigorous methodologies, technical innovation, and public benchmarks that have catalyzed further work in video generation, KYC (Know Your Customer) fraud detection, and the analysis of generative manipulation traces (Zhang et al., 19 Feb 2025, Korshunov et al., 28 Jul 2025).
1. FantasyID for Identity-Preserving Text-to-Video Generation
FantasyID introduces a tuning-free framework for identity-preserving text-to-video generation (IPT2V) that enhances large pre-trained video diffusion transformers (DiT) with explicit, multi-modal facial identity conditioning. The core workflow requires no fine-tuning of the base video generative model, achieving superior identity preservation and facial dynamics through architectural and data-driven innovations (Zhang et al., 19 Feb 2025).
Key architectural steps include:
- Extraction of 2D facial features from reference images using a convolutional "Face Abstractor."
- Reconstruction of a 3D sparse facial mesh via the DECA pipeline, generating FLAME point clouds.
- Fusion of 2D and 3D face features in a small transformer ("Fusion Transformer"), yielding a compact identity descriptor.
- Adaptive, layer-aware injection of the fused identity signal () into each DiT layer through a lightweight attention and residual network, allowing layer-wise control of identity vs. dynamic motion cues.
During sampling, the conditional denoiser becomes , ensuring that prompt-following video synthesis retains temporal consistency and explicit identity across frames.
2. 3D Facial Geometry Prior and 2D/3D Feature Fusion
The incorporation of a 3D facial geometry prior addresses key shortcomings of prior text-to-video approaches, notably the preservation of facial structure and avoidance of degenerate "copy-paste" artifacts across video frames.
Mathematically:
- The reference image is processed by DECA to yield a FLAME mesh .
- SpiralNet++ downsampling and addition of depth-based positional encodings produce .
- During denoising, is injected into the conditional network .
2D features are augmented via random selection from a multi-view pool of reference frames, maximizing pose diversity using head-pose distances . This forces the Face Abstractor to learn a distribution over facial appearances, improving temporal facial dynamics and mitigating overfitting to a single view.
3. Layer-Aware Adaptive Injection in Diffusion Transformers
Rather than naïve cross-attention, FantasyID uses a layer-aware, adaptive mechanism for integrating identity signals:
- 2D and 3D tokens are concatenated, passed through a Fusion Transformer, and projected to a face descriptor .
- For each DiT layer, queries are constructed from the layer input , while identity-conditioned keys and values are computed from .
- Layer-wise attention fusion and residual injection are performed via a learned conv–norm block (), supporting fine-grained balancing of identity preservation and motion modeling throughout the video diffusion process.
4. Training, Evaluation Protocols, and Quantitative Validation
FantasyID is trained end-to-end with standard latent-space diffusion reconstruction loss; no explicit identity-matching loss is imposed, as identity cues are enforced by the guidance. Classification-free guidance is used on the text prompt, with null-prompt dropout at 0.1.
Training details:
- AdamW optimizer, learning rate
- Batch size 16, 90k steps, 16 A100 GPUs, cosine-with-restarts schedule
- 50 denoising steps
Quantitative results (on 50 held-out reference faces, 1K prompts):
| Method | FID↓ | RS↑ | IFS↑ | FM↑ |
|---|---|---|---|---|
| ID-Animator | 138.3 | 0.35 | 0.98 | 0.18 |
| ConsisID | 149.7 | 0.47 | 0.93 | 0.54 |
| FantasyID | 142.5 | 0.57 | 0.95 | 0.61 |
User study (N=32) ratings on overall quality (OQ), face similarity (F-Sim), facial structure (F-Str), and dynamics (FD):
| Method | OQ | F-Sim | F-Str | FD |
|---|---|---|---|---|
| ID-Animator | 4.38 | 6.20 | 5.82 | 3.28 |
| ConsisID | 7.85 | 7.79 | 6.44 | 7.12 |
| FantasyID | 8.39 | 8.68 | 8.10 | 7.94 |
Qualitative analysis shows that the explicit 3D prior provides fine control of facial width/jaw shape, multi-view training produces more expressive faces, and layer-aware injection counteracts the instability from over-structuring or over-detail in individual DiT layers (Zhang et al., 19 Feb 2025).
5. The FantasyID Dataset for ID-Document Forgery Detection
The FantasyID dataset is established as a rigorous benchmark for detection of digitally manipulated ID documents in KYC and related identity authentication scenarios (Korshunov et al., 28 Jul 2025). Its design reflects the realities of generative-document attacks, providing a legally unencumbered, multi-lingual, and multi-device corpus.
Key statistics:
- Card templates: 13 “themes” with diverse design elements and languages (Arabic, Chinese, English, French, Hindi, Persian, Portuguese, Russian, Turkish, Ukrainian).
- Faces: Sourced from AMFD, Face London Research Dataset, HQ-WMCA, and public-domain Flickr images.
- Total unique digital cards: 362.
- Bona fide captures: 1,086 (362 cards × 3 captures via iPhone 15 Pro, Huawei Mate 30, Kyocera TASKalfa 2554ci scanner).
- Forgery samples: 1,572 train-val manipulations (face/text generated by InSwapper, FaceDancer, DiffSTE, TextDiffuser2); 1,085 test manipulations including face and text attacks.
- Image format and resolution: Full-res JPEGs (smartphones ≈12MP, scanner ≈600 dpi), no artificial compression beyond device defaults.
Pipeline steps include random card field generation, high-fidelity printing (Evolis Primacy 2), controlled device capture, and simulated digital forgeries using state-of-the-art inpainting and face-swapping tools.
6. Benchmark Evaluation and Advancements in Forgery Detection
Modern forgery detectors were evaluated on FantasyID using their released pretrained models and no additional fine-tuning:
- TruFor (multi-branch Transformer + Noiseprint++)
- MMFusion (TruFor extension with SRM/Bayar inputs)
- UniFD (CLIP:ViT‐L/14 linear classifier)
- FatFormer (CLIP:ViT‐L/14 with forgery-aware adapters)
For operational thresholding (FPR=10% on validation), observed false negative rates on test forgeries are high:
| Model | AUC | FPR (%) | FNR (%) | HTER (%) |
|---|---|---|---|---|
| TruFor | 93.5 | 4.9 | 62.0 | 33.4 |
| MMFusion | 94.4 | 4.0 | 47.7 | 25.8 |
| UniFD | 52.0 | 8.3 | 92.7 | 50.5 |
| FatFormer | 53.5 | 6.5 | 92.3 | 49.4 |
Face swap manipulations (Attack-2) are especially challenging, producing overlapping score distributions for real and fake, and false negative rates above 47% for even the strongest method (MMFusion). This suggests a critical vulnerability in current detector architectures when facing advanced face injection attacks. The dataset, available under CC BY 4.0, supports both non-commercial and commercial research use.
7. Applications, Limitations, and Future Directions
FantasyID (framework and dataset) has direct applications in several domains:
- Advanced, prompt-driven video generation with explicit identity control and dynamic facial synthesis.
- Empirical benchmarking and development of document forgery detectors for KYC and related security-critical pipelines.
- Multilingual and multi-device assessment of generative manipulation detection in realistic, print–scan–capture workflows.
- Study of algorithmic weaknesses in face and text injection detection.
Limitations noted include the use of non-governmental ("fantasy") card templates without physical anti-counterfeiting features, exclusion of physical tampering attacks, and relatively clean bona fide image acquisition conditions. No explicit identity loss is imposed in the generation objective, and current state-of-the-art forgery detectors underperform on face-centric manipulations.
A plausible implication is that further research should focus on simultaneous multi-identity conditioning, explicit identity-matching objectives, generalization to full-body video or scene-level identity preservation, and robust KYC-grade detection under adversarial generative attacks (Zhang et al., 19 Feb 2025, Korshunov et al., 28 Jul 2025).