Extending the training-free embedding-dynamics paradigm beyond audio

Extend the training-free embedding-dynamics paradigm used for TRACE to other modalities by constructing and evaluating analogous detectors for deepfake face detection with vision transformers, machine-generated text detection with language models, and cross-modal consistency verification in multimodal foundation models.

Background

TRACE demonstrates that first-order dynamics of frozen speech foundation model embeddings can detect partial audio deepfakes without training or labeled data, suggesting a modality-agnostic forensic signal in foundation model geometry.

The authors explicitly state that extending this paradigm beyond audio—to images/videos, text, and multimodal consistency—remains an open direction for future research.

References

Several directions remain open: frame-level anomaly maps could enable segment-level localization, directly addressing the short-spoof-segment weakness on HAD and ADD 2023; multi-layer fusion across layers 15--21 may improve robustness beyond the single optimal layer; and the same paradigm could extend beyond audio to deepfake face detection via vision transformers, machine-generated text detection via LLMs, or cross-modal consistency verification in multimodal foundation models.

TRACE: Training-Free Partial Audio Deepfake Detection via Embedding Trajectory Analysis of Speech Foundation Models  (2604.01083 - Khan et al., 1 Apr 2026) in Supplementary Material, Section: Extended Discussion