Towards Dataset Copyright Evasion Attack against Personalized Text-to-Image Diffusion Models

Published 5 May 2025 in cs.CV, cs.AI, and cs.CR | (2505.02824v1)

Abstract: Text-to-image (T2I) diffusion models have rapidly advanced, enabling high-quality image generation conditioned on textual prompts. However, the growing trend of fine-tuning pre-trained models for personalization raises serious concerns about unauthorized dataset usage. To combat this, dataset ownership verification (DOV) has emerged as a solution, embedding watermarks into the fine-tuning datasets using backdoor techniques. These watermarks remain inactive under benign samples but produce owner-specified outputs when triggered. Despite the promise of DOV for T2I diffusion models, its robustness against copyright evasion attacks (CEA) remains unexplored. In this paper, we explore how attackers can bypass these mechanisms through CEA, allowing models to circumvent watermarks even when trained on watermarked datasets. We propose the first copyright evasion attack (i.e., CEAT2I) specifically designed to undermine DOV in T2I diffusion models. Concretely, our CEAT2I comprises three stages: watermarked sample detection, trigger identification, and efficient watermark mitigation. A key insight driving our approach is that T2I models exhibit faster convergence on watermarked samples during the fine-tuning, evident through intermediate feature deviation. Leveraging this, CEAT2I can reliably detect the watermarked samples. Then, we iteratively ablate tokens from the prompts of detected watermarked samples and monitor shifts in intermediate features to pinpoint the exact trigger tokens. Finally, we adopt a closed-form concept erasure method to remove the injected watermark. Extensive experiments show that our CEAT2I effectively evades DOV mechanisms while preserving model performance.

Abstract PDF Upgrade to Chat

Summary

The paper introduces CEAT2I, a method that leverages early-stage feature deviations to robustly detect watermarked samples in personalized T2I diffusion models.
It employs a three-stage approach—detection using L2 norm differences, token ablation for trigger identification, and closed-form weight adjustments for watermark mitigation.
Extensive experiments on Stable Diffusion v1.4 demonstrate that CEAT2I significantly reduces watermark success rates while preserving high-quality image generation.

Text-to-image (T2I) diffusion models, like Stable Diffusion, have become powerful tools for content generation. The ability to fine-tune these models on specific datasets for personalization (e.g., mimicking an artist's style or generating brand-specific imagery) raises concerns about unauthorized data usage and copyright infringement. Dataset Ownership Verification (DOV) methods have emerged as a potential solution, often employing backdoor-based watermarking techniques. These methods embed hidden triggers in the training data that, when presented to a fine-tuned model, cause it to produce specific watermarked outputs (such as a logo or a signature), serving as proof that the model was trained on the protected dataset.

However, the effectiveness of these DOV mechanisms against copyright evasion attacks (CEAs) has been largely unexplored. This research investigates how an attacker, who has access to a watermarked dataset but doesn't know which samples are watermarked or the exact trigger, can fine-tune a model to achieve their desired output quality while simultaneously neutralizing the embedded watermark, thereby evading detection.

The paper analyzes existing backdoor removal techniques for T2I models, such as Textual Perturbation Defense (TPD) [24] and T2IShield [25], and finds them insufficient as robust CEAs for DOV. TPD relies on random textual perturbations, which inconsistently affect the hidden trigger. T2IShield attempts to detect watermarked samples by analyzing discrepancies in cross-attention maps; while effective for global watermarks, its performance degrades significantly when the watermark is a small local patch, as the attention differences become negligible.

To address these limitations, the paper proposes a novel Copyright Evasion Attack for T2I diffusion models, named CEAT2I. This method operates in three stages:

Watermarked Sample Detection: CEAT2I observes that during the early stages of fine-tuning on a watermarked dataset, T2I models adapt more rapidly to watermarked samples due to strong trigger-target correlations. This accelerated learning manifests as larger shifts in the intermediate feature representations of watermarked samples compared to benign ones. The method quantifies this difference using the $\mathcal{L}_2$ $L_{2}$ distance between the features of the original pre-trained model and the partially fine-tuned model at an early epoch ( $T_e$ $T_{e}$ ). By aggregating these feature deviations across multiple layers and applying a thresholding mechanism, CEAT2I can effectively distinguish watermarked samples from benign ones.
- Implementation Note: This stage requires access to the original pre-trained model and the model after a small number of fine-tuning epochs ( $T_e$ ). Feature deviations can be computed layer by layer within the denoising U-Net using the encoded latent representations of the image-text pairs.
- Example: For each sample $(\boldsymbol{x}, y)$ , calculate the $\mathcal{L}_2$ norm: $\|\text{features}_{\boldsymbol{\theta}^i}(\boldsymbol{z}_t, t, \boldsymbol{c}) - \text{features}_{\boldsymbol{\theta_w}^i}(\boldsymbol{z}_t, t, \boldsymbol{c})\|_2^2$ for layers $i=1, \dots, N$ . Normalize these deviations and count how many layers exceed a threshold $\alpha_1$ . If this count exceeds $\alpha_2$ , the sample is flagged as watermarked.
Trigger Identification: Once watermarked samples are identified, the next step is to pinpoint the trigger tokens within their text prompts. CEAT2I employs a word-level ablation strategy. For each detected watermarked prompt, it iteratively removes individual tokens and observes the impact on the model's intermediate features using the fully fine-tuned model ( $T_{total}$ $T_{t o t a l}$ ). Tokens whose removal causes an outlier shift in the feature representation (measured again by $\mathcal{L}_2$ $L_{2}$ distance from the full-prompt features) are identified as potential trigger candidates. A statistical thresholding approach (e.g., identifying tokens with deviation scores greater than $\mu + \sigma$ $μ + σ$ of all token-wise scores) is used. The token(s) most frequently identified as outliers across the watermarked samples are determined as the actual trigger.
- Implementation Note: This involves processing each detected watermarked prompt $L$ times (where $L$ is the number of tokens), each time with a different token removed. The feature deviations are computed at a specific layer (e.g., the second-to-last convolutional layer). Frequency analysis on the identified candidate tokens across all watermarked samples is then performed to find the most likely trigger.
- Example: For a watermarked prompt $\{y_w^1, \dots, y_w^L\}$ , compute feature deviations for prompts $\{y_w^2, \dots, y_w^L\}$ , $\{y_w^1, y_w^3, \dots, y_w^L\}$ , etc. A token $y_w^i$ is a candidate trigger if its removal causes a large feature shift relative to removing other tokens.
Efficient Watermark Mitigation: With the trigger tokens identified, CEAT2I leverages a closed-form concept erasure technique [31] to break the association between the trigger and the watermarked output in the fine-tuned model. This method directly modifies the cross-attention weights in the diffusion model's U-Net. The objective is to learn new weights that map the watermarked prompt (with trigger) to the desired output corresponding to the benign part of the prompt (without trigger), while preserving the model's normal behavior on benign prompts. This is formulated as a linear regression problem on the cross-attention weights, which has a closed-form solution, avoiding expensive re-training.
- Implementation Note: This involves constructing target text embeddings for watermarked samples (using the benign part of the prompt) and benign text embeddings for benign samples. The cross-attention weight matrix in the fine-tuned model is then updated using the provided closed-form solution based on these embeddings and the original model's weights.
- Architecture: This process modifies the cross-attention layers within the U-Net component of the diffusion model.

Extensive experiments on Stable Diffusion v1.4 fine-tuned on Pokemon, Ossaili, and Pranked03 datasets, facing four different DOV methods (BadT2I-L, BadT2I-G, Rickrolling, VD), demonstrate CEAT2I's effectiveness. Compared to baseline attacks, CEAT2I consistently achieves significantly lower Watermark Success Rates (WSR), indicating successful watermark neutralization (average WSR below 3% across datasets for a 20% watermarking rate), while maintaining high CLIP similarity scores, demonstrating minimal impact on benign generation quality (CLIP scores similar to or slightly lower than the fine-tuned model without attack). Notably, CEAT2I outperforms T2IShield, particularly against local patch watermarks, highlighting its robustness to different watermark types. Ablation studies confirm the effectiveness of the proposed detection and identification mechanisms and show resilience to variations in watermarking rate and trigger position. The method also shows resistance to a simple adaptive defense strategy.

While highly effective, CEAT2I introduces computational overhead during the detection and identification phases due to feature extraction and token ablation. Its current design is specific to T2I diffusion models and may require adaptation for other generative AI modalities. The authors explicitly state that CEAT2I is for research purposes only, intended to stress-test existing DOV methods and encourage the development of more robust data protection techniques for generative models.

The practical implication of this work is twofold: it provides a concrete method for potential attackers to circumvent existing backdoor-based DOV in T2I models, and more importantly, it serves as a valuable red-teaming tool for researchers to evaluate and improve the security of future DOV systems.

Markdown Report Issue