Watermark Channel Gradient Leaking Queries

Updated 24 January 2026

The paper demonstrates that gradient leakage from the watermark decoder enables attackers to train removal networks that effectively erase watermarks.
It introduces Decoder Gradient Shields (DGS-O, DGS-I, DGS-L) that reorient and perturb gradients, ensuring watermark extraction remains verifiable.
Experimental results on image denoising and text-to-image tasks confirm that DGS methods maintain high-fidelity outputs with minimal computational overhead.

Watermark channel gradient leaking queries refer to a class of attacks against box-free model watermarking schemes in which adversaries exploit gradient information from the watermark decoder to train a watermark remover network. Box-free watermarking architectures utilize an encoder and decoder pair to embed a high-entropy watermark image $W$ into a model’s output and extract it, respectively, for purposes of intellectual property protection. While prior robustness improvements in watermarking have targeted encoder resilience, the decoder component exposes vulnerability when gradients can be queried via its API, allowing removal attacks to adaptively erase embedded watermarks without compromising output fidelity (An et al., 17 Jan 2026, An et al., 28 Feb 2025).

1. Box-Free Model Watermarking Framework

Box-free watermarking schemes are characterized by their application to generative models with high-entropy outputs, where watermark embedding is performed through an encoder $\mathbb{E}$ and verification via a decoder $\mathbb{D}$ . Let $X_0$ denote the input image, $\mathbb{M}$ a protected image-to-image model, and $W$ a binary watermark. The image pipeline is formalized as: $Y = \mathbb{E}(\mathbb{M}(X_0), W)$ Watermarked outputs $Y$ are returned through a black-box API, while the decoder $\mathbb{D}$ is exposed for ownership verification. The decoder receives queries to extract $W$ from images, and its gradients become available upon query.

The owner’s training objective minimizes a combination of embedding loss and fidelity loss: $\mathcal{L}_{\mathrm{Victim}} = \alpha_1\,\mathcal{L}_{\mathrm{Embed}} + \alpha_2\,\mathcal{L}_{\mathrm{Fidelity}}$ where $\mathcal{L}_{\mathrm{Embed}}$ enforces correct watermark recovery and $\mathcal{L}_{\mathrm{Fidelity}}$ preserves visual integrity. The decoder’s joint training with the encoder results in gradient information that, if exposed, leaks critical watermark-removal directions.

2. Gradient-Based Watermark Removal Attacks

The central vulnerability in box-free watermarking arises when the decoder’s API leaks gradients in response to input queries, enabling attackers to perform gradient-based optimization for watermark erasure. Given queries $(X_0, Y)$ , an adversary trains a remover network $\mathbb{R}$ to minimize: $\mathcal{L}_{\mathrm{Attack}} = \beta_1\|\mathbb{D}[\mathbb{R}(Y)] - W_0\|_2^2 + \beta_2\|\mathbb{R}(Y) - Y\|_2^2$ where $W_0$ is the null watermark.

Critically, the attacker obtains the gradient: $g = \nabla_{\mathbb{R}(Y)} \|\mathbb{D}(\mathbb{R}(Y)) - W_0\|_2^2 = 2(Z - W_0)^\top \frac{\partial Z}{\partial \mathbb{R}(Y)}$ with $Z = \mathbb{D}[\mathbb{R}(Y)]$ . This feedback is sufficient for $\mathbb{R}$ to be optimized such that watermark evidence is suppressed while maintaining output fidelity. Baseline removal attacks demonstrate rapid convergence of the removal loss to near-zero, rendering the watermark unverifiable.

3. Family of Decoder Gradient Shields (DGSs)

Decoder Gradient Shields (DGSs) are defense mechanisms implemented at various points in the decoder pipeline to provably obstruct gradient-based watermark removal without affecting legitimate verification or output quality (An et al., 17 Jan 2026, An et al., 28 Feb 2025). The DGS variants include:

DGS-O (Output Shield): A closed-form linear transform is applied to the decoder output. Given $Z = \mathbb{D}(S)$ , the transformation is:

$Z^* = -P Z + (P+I) W$

where $P$ is a positive-definite matrix and the transformation occurs when normalized cross-correlation $\mathrm{NC}(Z, W) \geq 0.96$ . The backpropagated gradient becomes:

$g^* = -2 (Z - W_0)^\top P \frac{\partial Z}{\partial \mathbb{R}(Y)}$

The shield provably reorients and rescales gradients, reducing effective step size and causing the attacker's optimization to fail.

DGS-I (Input Shield): An adversarial perturbation $\eta(S)$ is injected at the decoder input, orthogonal to true gradients:

$\tilde{S} = S + \eta(S), \quad \|\eta(S)\|_\infty \leq \epsilon$

and

$\nabla_S \mathcal{L}_{\mathrm{Removal}}(S)^\top \eta(S) = 0$

The first-order Taylor term in the loss expansion vanishes, yielding only higher-order gradient interference and randomized directions across queries.

DGS-L (Layer Shield): Perturbation $\eta$ is injected within an intermediate decoder layer $k$ :

$\widetilde{D^{(k)}(S)} = D^{(k)}(S) + \eta(D^{(k)}(S))$

with orthogonality to the removal loss gradient. DGS-L offers computational efficiency and conceals the perturbation locus from attackers.

4. Theoretical Guarantees of Defense

The efficacy of DGS mechanisms is formally guaranteed:

Non-Convergence under DGS-O: For any gradient-based removal network $\mathbb{R}$ optimized on the shielded gradient $g^*$ , for all iterations $t$ :

$\|\mathbb{D}[\mathbb{R}_t(Y)] - W_0\|_2^2 \geq \delta$

for some constant $\delta > 0$ . The descent direction is reversed and gradient magnitude suppressed, precluding convergence to watermark-free solutions.

Orthogonality-Induced Interference (DGS-I/DGS-L): With perturbations orthogonal to the gradient, the first-order term in the loss is zero and only residual noise remains, rendering gradient-based optimization high-variance and ineffective.
Fidelity Preservation: When the eigenvalues of $P$ are chosen very small, $Z^* \approx W$ , so legitimate extraction of the watermark remains unaffected up to negligible distortion.

5. Experimental Evaluation and Practical Impact

Evaluation on denoising (image deraining, PASCAL VOC) and text-to-image generation (Stable Diffusion) tasks demonstrates the robust defense properties and minimal overhead of DGS variants. Metrics reported include PSNR (Peak Signal-to-Noise Ratio), MS-SSIM (Multi-Scale Structural Similarity), and defense success rate (SR; fraction of images from which $W$ is still extractable post-attack). Results show:

Task	Method	PSNR↑	MS-SSIM↑	SR↑
Deraining	No DGS	37.95	0.9968	0%
	DGS-O	37.57	0.9959	100%
	DGS-I	37.52	0.9958	100%
	DGS-L	37.54	0.9959	100%
Text-to-Image	No DGS	29.61	0.9806	0%
	DGS-O	29.25	0.9794	100%
	DGS-I	29.22	0.9792	100%
	DGS-L	29.24	0.9793	100%

Empirical observations include:

Watermark removal attacks without DGS rapidly converge to zero loss, eliminating watermark evidence.
DGS variants ensure the attack loss stalls far from zero, maintaining the watermark extraction rate at 100% under all tested conditions.
Visual fidelity loss is negligible (PSNR drop <0.3 dB, MS-SSIM loss <0.001).
Computational overhead per query is minimal: DGS-O (0.08 ms), DGS-I (0.07 s), and DGS-L (0.047 s) on RTX 5000 Ada GPU.

6. Extensions, Open Directions, and Significance

The closed-form gradient-reorientation and minimal perturbation principles underlying DGS methods are broadly applicable to any box-free watermarking or generative model pipeline susceptible to gradient-leaking attacks. Practical implementation involves:

Identifying the trusted output $W$ .
Selecting an appropriate positive-definite matrix $P$ to ensure both non-descent and fidelity preservation.
Integrating the shield transform:

$Z^* = -P Z + (P + I) W$

Extensions could include adaptive $P$ selection, multi-stage shields to prevent reverse-engineering, and randomized smoothing to counter gradient-inversion and sign-flip attacks.

This suggests that future research could focus on compounded shielding techniques and the adaptation of DGS methodology to other API-based intellectual property protection frameworks, particularly where gradient leakage remains a security concern (An et al., 17 Jan 2026, An et al., 28 Feb 2025).

7. Context and Implications

Watermark channel gradient leaking queries leverage a precise and currently exploitable vulnerability in watermark decoder APIs, enabling adversaries to train high-fidelity removal networks even in the presence of robust encoders. Decoder Gradient Shields represent the first theoretically provable and practically validated defense, averting effective gradient exploitation while preserving utility for legitimate users and proprietors. The DGS paradigm reframes adversarial gradient manipulation from an attack surface to a robust defensive tool suitable for entrenched and emerging applications in model intellectual property protection.

Markdown Report Issue Upgrade to Chat

References (2)

Decoder Gradient Shields: A Family of Provable and High-Fidelity Methods Against Gradient-Based Box-Free Watermark Removal (2026)

Decoder Gradient Shield: Provable and High-Fidelity Prevention of Gradient-Based Box-Free Watermark Removal (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Watermark Channel Gradient Leaking Queries.