Watermark Channel Gradient Leaking Queries
- The paper demonstrates that gradient leakage from the watermark decoder enables attackers to train removal networks that effectively erase watermarks.
- It introduces Decoder Gradient Shields (DGS-O, DGS-I, DGS-L) that reorient and perturb gradients, ensuring watermark extraction remains verifiable.
- Experimental results on image denoising and text-to-image tasks confirm that DGS methods maintain high-fidelity outputs with minimal computational overhead.
Watermark channel gradient leaking queries refer to a class of attacks against box-free model watermarking schemes in which adversaries exploit gradient information from the watermark decoder to train a watermark remover network. Box-free watermarking architectures utilize an encoder and decoder pair to embed a high-entropy watermark image into a model’s output and extract it, respectively, for purposes of intellectual property protection. While prior robustness improvements in watermarking have targeted encoder resilience, the decoder component exposes vulnerability when gradients can be queried via its API, allowing removal attacks to adaptively erase embedded watermarks without compromising output fidelity (An et al., 17 Jan 2026, An et al., 28 Feb 2025).
1. Box-Free Model Watermarking Framework
Box-free watermarking schemes are characterized by their application to generative models with high-entropy outputs, where watermark embedding is performed through an encoder and verification via a decoder . Let denote the input image, a protected image-to-image model, and a binary watermark. The image pipeline is formalized as: Watermarked outputs are returned through a black-box API, while the decoder is exposed for ownership verification. The decoder receives queries to extract from images, and its gradients become available upon query.
The owner’s training objective minimizes a combination of embedding loss and fidelity loss: where enforces correct watermark recovery and preserves visual integrity. The decoder’s joint training with the encoder results in gradient information that, if exposed, leaks critical watermark-removal directions.
2. Gradient-Based Watermark Removal Attacks
The central vulnerability in box-free watermarking arises when the decoder’s API leaks gradients in response to input queries, enabling attackers to perform gradient-based optimization for watermark erasure. Given queries , an adversary trains a remover network to minimize: where is the null watermark.
Critically, the attacker obtains the gradient: with . This feedback is sufficient for to be optimized such that watermark evidence is suppressed while maintaining output fidelity. Baseline removal attacks demonstrate rapid convergence of the removal loss to near-zero, rendering the watermark unverifiable.
3. Family of Decoder Gradient Shields (DGSs)
Decoder Gradient Shields (DGSs) are defense mechanisms implemented at various points in the decoder pipeline to provably obstruct gradient-based watermark removal without affecting legitimate verification or output quality (An et al., 17 Jan 2026, An et al., 28 Feb 2025). The DGS variants include:
- DGS-O (Output Shield): A closed-form linear transform is applied to the decoder output. Given , the transformation is:
where is a positive-definite matrix and the transformation occurs when normalized cross-correlation . The backpropagated gradient becomes:
The shield provably reorients and rescales gradients, reducing effective step size and causing the attacker's optimization to fail.
- DGS-I (Input Shield): An adversarial perturbation is injected at the decoder input, orthogonal to true gradients:
and
The first-order Taylor term in the loss expansion vanishes, yielding only higher-order gradient interference and randomized directions across queries.
- DGS-L (Layer Shield): Perturbation is injected within an intermediate decoder layer :
with orthogonality to the removal loss gradient. DGS-L offers computational efficiency and conceals the perturbation locus from attackers.
4. Theoretical Guarantees of Defense
The efficacy of DGS mechanisms is formally guaranteed:
- Non-Convergence under DGS-O: For any gradient-based removal network optimized on the shielded gradient , for all iterations :
for some constant . The descent direction is reversed and gradient magnitude suppressed, precluding convergence to watermark-free solutions.
- Orthogonality-Induced Interference (DGS-I/DGS-L): With perturbations orthogonal to the gradient, the first-order term in the loss is zero and only residual noise remains, rendering gradient-based optimization high-variance and ineffective.
- Fidelity Preservation: When the eigenvalues of are chosen very small, , so legitimate extraction of the watermark remains unaffected up to negligible distortion.
5. Experimental Evaluation and Practical Impact
Evaluation on denoising (image deraining, PASCAL VOC) and text-to-image generation (Stable Diffusion) tasks demonstrates the robust defense properties and minimal overhead of DGS variants. Metrics reported include PSNR (Peak Signal-to-Noise Ratio), MS-SSIM (Multi-Scale Structural Similarity), and defense success rate (SR; fraction of images from which is still extractable post-attack). Results show:
| Task | Method | PSNR↑ | MS-SSIM↑ | SR↑ |
|---|---|---|---|---|
| Deraining | No DGS | 37.95 | 0.9968 | 0% |
| DGS-O | 37.57 | 0.9959 | 100% | |
| DGS-I | 37.52 | 0.9958 | 100% | |
| DGS-L | 37.54 | 0.9959 | 100% | |
| Text-to-Image | No DGS | 29.61 | 0.9806 | 0% |
| DGS-O | 29.25 | 0.9794 | 100% | |
| DGS-I | 29.22 | 0.9792 | 100% | |
| DGS-L | 29.24 | 0.9793 | 100% |
Empirical observations include:
- Watermark removal attacks without DGS rapidly converge to zero loss, eliminating watermark evidence.
- DGS variants ensure the attack loss stalls far from zero, maintaining the watermark extraction rate at 100% under all tested conditions.
- Visual fidelity loss is negligible (PSNR drop <0.3 dB, MS-SSIM loss <0.001).
- Computational overhead per query is minimal: DGS-O (0.08 ms), DGS-I (0.07 s), and DGS-L (0.047 s) on RTX 5000 Ada GPU.
6. Extensions, Open Directions, and Significance
The closed-form gradient-reorientation and minimal perturbation principles underlying DGS methods are broadly applicable to any box-free watermarking or generative model pipeline susceptible to gradient-leaking attacks. Practical implementation involves:
- Identifying the trusted output .
- Selecting an appropriate positive-definite matrix to ensure both non-descent and fidelity preservation.
- Integrating the shield transform:
Extensions could include adaptive selection, multi-stage shields to prevent reverse-engineering, and randomized smoothing to counter gradient-inversion and sign-flip attacks.
This suggests that future research could focus on compounded shielding techniques and the adaptation of DGS methodology to other API-based intellectual property protection frameworks, particularly where gradient leakage remains a security concern (An et al., 17 Jan 2026, An et al., 28 Feb 2025).
7. Context and Implications
Watermark channel gradient leaking queries leverage a precise and currently exploitable vulnerability in watermark decoder APIs, enabling adversaries to train high-fidelity removal networks even in the presence of robust encoders. Decoder Gradient Shields represent the first theoretically provable and practically validated defense, averting effective gradient exploitation while preserving utility for legitimate users and proprietors. The DGS paradigm reframes adversarial gradient manipulation from an attack surface to a robust defensive tool suitable for entrenched and emerging applications in model intellectual property protection.