Papers
Topics
Authors
Recent
Search
2000 character limit reached

Watermark Channel Gradient Leaking Queries

Updated 24 January 2026
  • The paper demonstrates that gradient leakage from the watermark decoder enables attackers to train removal networks that effectively erase watermarks.
  • It introduces Decoder Gradient Shields (DGS-O, DGS-I, DGS-L) that reorient and perturb gradients, ensuring watermark extraction remains verifiable.
  • Experimental results on image denoising and text-to-image tasks confirm that DGS methods maintain high-fidelity outputs with minimal computational overhead.

Watermark channel gradient leaking queries refer to a class of attacks against box-free model watermarking schemes in which adversaries exploit gradient information from the watermark decoder to train a watermark remover network. Box-free watermarking architectures utilize an encoder and decoder pair to embed a high-entropy watermark image WW into a model’s output and extract it, respectively, for purposes of intellectual property protection. While prior robustness improvements in watermarking have targeted encoder resilience, the decoder component exposes vulnerability when gradients can be queried via its API, allowing removal attacks to adaptively erase embedded watermarks without compromising output fidelity (An et al., 17 Jan 2026, An et al., 28 Feb 2025).

1. Box-Free Model Watermarking Framework

Box-free watermarking schemes are characterized by their application to generative models with high-entropy outputs, where watermark embedding is performed through an encoder E\mathbb{E} and verification via a decoder D\mathbb{D}. Let X0X_0 denote the input image, M\mathbb{M} a protected image-to-image model, and WW a binary watermark. The image pipeline is formalized as: Y=E(M(X0),W)Y = \mathbb{E}(\mathbb{M}(X_0), W) Watermarked outputs YY are returned through a black-box API, while the decoder D\mathbb{D} is exposed for ownership verification. The decoder receives queries to extract WW from images, and its gradients become available upon query.

The owner’s training objective minimizes a combination of embedding loss and fidelity loss: LVictim=α1 LEmbed+α2 LFidelity\mathcal{L}_{\mathrm{Victim}} = \alpha_1\,\mathcal{L}_{\mathrm{Embed}} + \alpha_2\,\mathcal{L}_{\mathrm{Fidelity}} where LEmbed\mathcal{L}_{\mathrm{Embed}} enforces correct watermark recovery and LFidelity\mathcal{L}_{\mathrm{Fidelity}} preserves visual integrity. The decoder’s joint training with the encoder results in gradient information that, if exposed, leaks critical watermark-removal directions.

2. Gradient-Based Watermark Removal Attacks

The central vulnerability in box-free watermarking arises when the decoder’s API leaks gradients in response to input queries, enabling attackers to perform gradient-based optimization for watermark erasure. Given queries (X0,Y)(X_0, Y), an adversary trains a remover network R\mathbb{R} to minimize: LAttack=β1∥D[R(Y)]−W0∥22+β2∥R(Y)−Y∥22\mathcal{L}_{\mathrm{Attack}} = \beta_1\|\mathbb{D}[\mathbb{R}(Y)] - W_0\|_2^2 + \beta_2\|\mathbb{R}(Y) - Y\|_2^2 where W0W_0 is the null watermark.

Critically, the attacker obtains the gradient: g=∇R(Y)∥D(R(Y))−W0∥22=2(Z−W0)⊤∂Z∂R(Y)g = \nabla_{\mathbb{R}(Y)} \|\mathbb{D}(\mathbb{R}(Y)) - W_0\|_2^2 = 2(Z - W_0)^\top \frac{\partial Z}{\partial \mathbb{R}(Y)} with Z=D[R(Y)]Z = \mathbb{D}[\mathbb{R}(Y)]. This feedback is sufficient for R\mathbb{R} to be optimized such that watermark evidence is suppressed while maintaining output fidelity. Baseline removal attacks demonstrate rapid convergence of the removal loss to near-zero, rendering the watermark unverifiable.

3. Family of Decoder Gradient Shields (DGSs)

Decoder Gradient Shields (DGSs) are defense mechanisms implemented at various points in the decoder pipeline to provably obstruct gradient-based watermark removal without affecting legitimate verification or output quality (An et al., 17 Jan 2026, An et al., 28 Feb 2025). The DGS variants include:

  • DGS-O (Output Shield): A closed-form linear transform is applied to the decoder output. Given Z=D(S)Z = \mathbb{D}(S), the transformation is:

Z∗=−PZ+(P+I)WZ^* = -P Z + (P+I) W

where PP is a positive-definite matrix and the transformation occurs when normalized cross-correlation NC(Z,W)≥0.96\mathrm{NC}(Z, W) \geq 0.96. The backpropagated gradient becomes:

g∗=−2(Z−W0)⊤P∂Z∂R(Y)g^* = -2 (Z - W_0)^\top P \frac{\partial Z}{\partial \mathbb{R}(Y)}

The shield provably reorients and rescales gradients, reducing effective step size and causing the attacker's optimization to fail.

  • DGS-I (Input Shield): An adversarial perturbation η(S)\eta(S) is injected at the decoder input, orthogonal to true gradients:

S~=S+η(S),∥η(S)∥∞≤ϵ\tilde{S} = S + \eta(S), \quad \|\eta(S)\|_\infty \leq \epsilon

and

∇SLRemoval(S)⊤η(S)=0\nabla_S \mathcal{L}_{\mathrm{Removal}}(S)^\top \eta(S) = 0

The first-order Taylor term in the loss expansion vanishes, yielding only higher-order gradient interference and randomized directions across queries.

  • DGS-L (Layer Shield): Perturbation η\eta is injected within an intermediate decoder layer kk:

D(k)(S)~=D(k)(S)+η(D(k)(S))\widetilde{D^{(k)}(S)} = D^{(k)}(S) + \eta(D^{(k)}(S))

with orthogonality to the removal loss gradient. DGS-L offers computational efficiency and conceals the perturbation locus from attackers.

4. Theoretical Guarantees of Defense

The efficacy of DGS mechanisms is formally guaranteed:

  • Non-Convergence under DGS-O: For any gradient-based removal network R\mathbb{R} optimized on the shielded gradient g∗g^*, for all iterations tt:

∥D[Rt(Y)]−W0∥22≥δ\|\mathbb{D}[\mathbb{R}_t(Y)] - W_0\|_2^2 \geq \delta

for some constant δ>0\delta > 0. The descent direction is reversed and gradient magnitude suppressed, precluding convergence to watermark-free solutions.

  • Orthogonality-Induced Interference (DGS-I/DGS-L): With perturbations orthogonal to the gradient, the first-order term in the loss is zero and only residual noise remains, rendering gradient-based optimization high-variance and ineffective.
  • Fidelity Preservation: When the eigenvalues of PP are chosen very small, Z∗≈WZ^* \approx W, so legitimate extraction of the watermark remains unaffected up to negligible distortion.

5. Experimental Evaluation and Practical Impact

Evaluation on denoising (image deraining, PASCAL VOC) and text-to-image generation (Stable Diffusion) tasks demonstrates the robust defense properties and minimal overhead of DGS variants. Metrics reported include PSNR (Peak Signal-to-Noise Ratio), MS-SSIM (Multi-Scale Structural Similarity), and defense success rate (SR; fraction of images from which WW is still extractable post-attack). Results show:

Task Method PSNR↑ MS-SSIM↑ SR↑
Deraining No DGS 37.95 0.9968 0%
DGS-O 37.57 0.9959 100%
DGS-I 37.52 0.9958 100%
DGS-L 37.54 0.9959 100%
Text-to-Image No DGS 29.61 0.9806 0%
DGS-O 29.25 0.9794 100%
DGS-I 29.22 0.9792 100%
DGS-L 29.24 0.9793 100%

Empirical observations include:

  • Watermark removal attacks without DGS rapidly converge to zero loss, eliminating watermark evidence.
  • DGS variants ensure the attack loss stalls far from zero, maintaining the watermark extraction rate at 100% under all tested conditions.
  • Visual fidelity loss is negligible (PSNR drop <0.3 dB, MS-SSIM loss <0.001).
  • Computational overhead per query is minimal: DGS-O (0.08 ms), DGS-I (0.07 s), and DGS-L (0.047 s) on RTX 5000 Ada GPU.

6. Extensions, Open Directions, and Significance

The closed-form gradient-reorientation and minimal perturbation principles underlying DGS methods are broadly applicable to any box-free watermarking or generative model pipeline susceptible to gradient-leaking attacks. Practical implementation involves:

  • Identifying the trusted output WW.
  • Selecting an appropriate positive-definite matrix PP to ensure both non-descent and fidelity preservation.
  • Integrating the shield transform:

Z∗=−PZ+(P+I)WZ^* = -P Z + (P + I) W

Extensions could include adaptive PP selection, multi-stage shields to prevent reverse-engineering, and randomized smoothing to counter gradient-inversion and sign-flip attacks.

This suggests that future research could focus on compounded shielding techniques and the adaptation of DGS methodology to other API-based intellectual property protection frameworks, particularly where gradient leakage remains a security concern (An et al., 17 Jan 2026, An et al., 28 Feb 2025).

7. Context and Implications

Watermark channel gradient leaking queries leverage a precise and currently exploitable vulnerability in watermark decoder APIs, enabling adversaries to train high-fidelity removal networks even in the presence of robust encoders. Decoder Gradient Shields represent the first theoretically provable and practically validated defense, averting effective gradient exploitation while preserving utility for legitimate users and proprietors. The DGS paradigm reframes adversarial gradient manipulation from an attack surface to a robust defensive tool suitable for entrenched and emerging applications in model intellectual property protection.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Watermark Channel Gradient Leaking Queries.