VIDSTAMP: A Temporally-Aware Watermark for Ownership and Integrity in Video Diffusion Models

Published 2 May 2025 in cs.CV, cs.CR, and cs.LG | (2505.01406v1)

Abstract: The rapid rise of video diffusion models has enabled the generation of highly realistic and temporally coherent videos, raising critical concerns about content authenticity, provenance, and misuse. Existing watermarking approaches, whether passive, post-hoc, or adapted from image-based techniques, often struggle to withstand video-specific manipulations such as frame insertion, dropping, or reordering, and typically degrade visual quality. In this work, we introduce VIDSTAMP, a watermarking framework that embeds per-frame or per-segment messages directly into the latent space of temporally-aware video diffusion models. By fine-tuning the model's decoder through a two-stage pipeline, first on static image datasets to promote spatial message separation, and then on synthesized video sequences to restore temporal consistency, VIDSTAMP learns to embed high-capacity, flexible watermarks with minimal perceptual impact. Leveraging architectural components such as 3D convolutions and temporal attention, our method imposes no additional inference cost and offers better perceptual quality than prior methods, while maintaining comparable robustness against common distortions and tampering. VIDSTAMP embeds 768 bits per video (48 bits per frame) with a bit accuracy of 95.0%, achieves a log P-value of -166.65 (lower is better), and maintains a video quality score of 0.836, comparable to unwatermarked outputs (0.838) and surpassing prior methods in capacity-quality tradeoffs. Code: Code: \url{https://github.com/SPIN-UMass/VidStamp}

Abstract PDF Upgrade to Chat

Summary

The paper introduces VidStamp, a framework that fine-tunes a video diffusion model's decoder to embed watermarks with temporal consistency.
The paper demonstrates high watermark capacity by embedding 768 bits per video with 95% extraction accuracy and effective tamper localization.
The paper shows that VidStamp preserves video quality comparable to unwatermarked outputs while resisting common distortions.

Here is a detailed summary of the paper "VIDSTAMP: A Temporally-Aware Watermark for Ownership and Integrity in Video Diffusion Models" (2505.01406).

The paper addresses the critical need for robust watermarking techniques for videos generated by increasingly realistic diffusion models. Existing watermarking methods often struggle with video-specific challenges like temporal manipulations (frame insertion, dropping, reordering) and can degrade visual quality. Passive methods, which detect statistical artifacts, are becoming ineffective against modern models. Post-hoc watermarking, applied after generation, is often fragile and easily removed. While image-based watermarking integrated into diffusion models shows promise, directly applying it frame-by-frame to video ignores temporal dependencies.

VidStamp is proposed as a novel framework to embed per-frame or per-segment messages directly into the latent space of temporally-aware video diffusion models during the generation process. The core idea is to fine-tune the model's decoder, leveraging its inherent components like 3D convolutions and temporal attention, to carry watermark information across frames.

The implementation of VidStamp involves a two-stage fine-tuning pipeline for the video diffusion model's decoder:

Stage 1 (Image Fine-tuning): The decoder is initially fine-tuned on a static image dataset (like COCO). Images are treated as independent frames in a pseudo-video batch. This stage encourages the decoder to learn how to embed distinct messages into individual frames, promoting spatial separation of the watermark signal.
Stage 2 (Video Fine-tuning): The decoder is then fine-tuned using synthetic videos generated by the same diffusion model. This stage adapts the decoder to the temporal dynamics of video, ensuring temporal consistency of the watermark while preserving the ability for frame-level embedding and high generation fidelity.

During training, a fixed set of messages (bit-strings) is assigned to frames or segments. The decoder processes the latent representations of the frames and outputs watermarked frames. A pre-trained message extractor (adapted from the HiDDeN architecture) recovers the message from each output frame. The training objective is a weighted sum of:

Message Loss: Binary Cross-Entropy ( $\mathcal{L}_{\text{msg}}$ ) between the extracted and ground-truth bits, promoting accurate message recovery.
Perceptual Loss: Watson-VGG loss ( $\mathcal{L}_{\text{perc}}$ ) on a frame-by-frame basis, minimizing visual differences between original and watermarked frames to ensure imperceptibility. Only the decoder weights are updated during training; the VAE encoder and the message extractor are kept frozen.

A key benefit of VidStamp is that it introduces no additional inference cost during video generation, as the watermarking is integrated into the existing decoding process. It offers high capacity, embedding 768 bits per video (48 bits across 16 frames) in the main experiments, significantly higher than comparison baselines.

VidStamp supports flexible capacity control through either per-frame embedding (each frame gets a unique message) or segment-wise embedding. In segment-wise embedding, the video is divided into fixed-length segments ( $K$ frames), and the same message is embedded across all frames within a segment. This reduces the total unique message count and can be beneficial for longer videos or specific application requirements.

A practical application enabled by VidStamp is temporal tamper localization. By embedding unique (or segment-repeated) messages at the frame level, the paper proposes an algorithm to detect and locate manipulations like frame insertion, deletion, or swapping. The algorithm compares the decoded message of each frame in a potentially tampered video against the set of original template messages using Hamming similarity. Frames are flagged as inserted if no match surpasses a threshold, or assigned to the best-matching original index otherwise. This allows identification of exactly which frames were affected.

The authors evaluate VidStamp based on Stable Video Diffusion (SVD) as the base model. They compare it against post-hoc methods like RivaGAN [2019.09.01285] and VideoSeal (Fernandez et al., 2024), and the integrated VideoShield (Hu et al., 24 Jan 2025). Evaluation metrics include:

Watermark Performance: Bit Accuracy and Log P-Value (a statistical measure considering both accuracy and capacity, lower is better).
Video Quality: VBench metrics covering Subject Consistency, Background Consistency, Motion Smoothness, Aesthetic Quality, and Imaging Quality.
Robustness: Bit accuracy/Log P-value under 11 common video distortions (resize, JPEG, crop, rotation, brightness, contrast, saturation, sharpness, Gaussian noise, MPEG4).
Tamper Localization: Accuracy in identifying swapped, inserted, or dropped frames, and combinations thereof.

Key experimental results demonstrate:

Main Results: VidStamp achieves an average video quality score (0.836) nearly identical to the unwatermarked SVD output (0.838) and comparable to or better than baselines. It embeds 768 bits per video with 95.0% accuracy. Crucially, it achieves a significantly better log P-value (-166.65) than baselines (VideoShield: -149.0, VideoSeal: -26.9, RivaGAN: -9.6), indicating higher statistical confidence in watermark detection despite higher capacity. Visual analysis confirms changes are imperceptible and often localized along edges.
Robustness: While performance varies by distortion type, VidStamp's log P-value is better than baselines in 5 out of 11 tested distortions, showcasing strong overall robustness even at high capacity.
Tamper Localization: Using a similarity threshold of 0.8, VidStamp achieves over 95% accuracy in localizing swapped, inserted, and dropped frames, and their combinations, demonstrating its effectiveness for integrity verification.
Ablation Studies: Varying segment size ( $K$ ) shows slight bit accuracy increase with larger $K$ (due to redundancy), stable video quality, and consistent robustness across different $K$ values. Testing against more aggressive tampering (multiple swaps/drops/insertions) shows localization accuracy remains high, especially with noise insertions which are easily detected.

Limitations acknowledged by the authors include the requirement for direct access to and modification of the diffusion model's decoder (precluding black-box API use), the computational cost of the two-stage fine-tuning process, and the need for further testing against sophisticated targeted adversarial removal attacks.

In conclusion, VidStamp presents a practical and effective method for integrating high-capacity, temporally-aware watermarks into video diffusion models during generation. By fine-tuning the decoder, it achieves robust embedding with minimal quality degradation, no inference overhead, and supports precise frame-level tamper localization, offering a valuable tool for provenance tracking and integrity verification of AI-generated video.

Markdown Report Issue