Extending the training-free embedding-dynamics paradigm beyond audio
Extend the training-free embedding-dynamics paradigm used for TRACE to other modalities by constructing and evaluating analogous detectors for deepfake face detection with vision transformers, machine-generated text detection with language models, and cross-modal consistency verification in multimodal foundation models.
References
Several directions remain open: frame-level anomaly maps could enable segment-level localization, directly addressing the short-spoof-segment weakness on HAD and ADD 2023; multi-layer fusion across layers 15--21 may improve robustness beyond the single optimal layer; and the same paradigm could extend beyond audio to deepfake face detection via vision transformers, machine-generated text detection via LLMs, or cross-modal consistency verification in multimodal foundation models.