Box-Free Model Watermarking

Updated 24 January 2026

Box-free watermarking is a technique that embeds ownership in the outputs of machine learning models without accessing internal parameters or using trigger inputs.
Methods like Wide Flat Minimum and Gaussian Shading achieve robust multi-bit payload extraction with negligible impact on performance metrics like FID and PSNR.
These techniques incorporate cryptographic and side-channel strategies to safeguard against reverse engineering and ensure reliable IP attribution in distributed AI deployments.

Box-free model watermarking refers to a family of methods that embed ownership or provenance information in the outputs or behavior of machine learning models without requiring access to their internal parameters (“white-box”) or reliance on special input triggers (“black-box”/“backdoor”). Instead, watermarks can typically be extracted directly from arbitrary outputs, normal queries, or via auxiliary side channels—without modifying the protected model’s parameters or architecture. This approach is especially pertinent for the protection of generative models and distributed AI deployments, as it promises minimal performance interference, ease of large-scale deployment, and strong IP attribution. Box-free watermarking has seen extensive exploration across diffusion models, GANs, encoder–decoder networks, multimodal architectures, and 3D geometry, yielding a rich taxonomy of methodologies, security goals, and attack/defense considerations.

1. Taxonomy and Definitions

Box-free watermarking stands in contrast to two classic paradigms:

White-box watermarking: Embeds the payload in model parameters; extraction requires direct access.
Black-box (backdoor) watermarking: Embeds payload in model behavior triggered by specific crafted inputs.
Box-free watermarking: Embeds payload such that extraction is possible from ordinary outputs under standard inference, without parameter access or input triggers.

Within box-free watermarking, objectives fall into several categories:

Invisible (imperceptible) watermarking: The embedded signal is statistically or perceptually indistinguishable from unmarked outputs, with no drop in primary performance metrics (e.g., FID, PSNR, accuracy).
Non-intrusive/absolute fidelity: The protected model’s parameters and outputs for non-verification queries are bit-identical to the unwatermarked model (An et al., 24 Jul 2025).
Multi-bit and traceable: The extraction process recovers an identifiable bit-string or payload, supporting attribution among many users or model copies (Zhang et al., 28 Jul 2025).

The term "box-free" is sometimes used interchangeably with “non-invasive” or "side-channel" watermarking, particularly where the extraction method is decoupled from the host model’s architecture (Chen et al., 2024, An et al., 24 Jul 2025).

2. Box-Free Watermarking Mechanisms

2.1. Direct Output Embedding in Generative Models

In generative models (GANs, diffusion models, I2I networks), box-free watermarking often involves augmenting outputs so the watermark can be robustly extracted from any generated sample.

Wide Flat Minimum (WFM) Watermarking: The generator is jointly trained with an additional watermarking loss using a pre-trained decoder $D_w$ , enforcing that every output $G_\theta(x)$ contains a multi-bit payload. To ensure resilience against model-level attacks (fine-tuning, pruning, quantization), a flatness regularizer enforces invariance of the watermarking loss across a ball of parameter perturbations (Fei et al., 2023):

$L_{G} = L_{\rm adv}(\theta_0) + \gamma \frac{1}{k+1} \sum_{i=0}^k L_{wm}(\theta_i)$

where $\theta_i = \theta_0 + \delta_i$ and $\delta_i$ are random noise vectors.

This approach achieves negligible FID/PSNR degradation and robust bitwise extraction (e.g., $p_b > 0.98$ under moderate attacks), outperforming prior schemes for a variety of GAN and CNN models (Fei et al., 2023).

Gaussian Shading for Diffusion Models: Watermark embedding is performed in the initial latent sampling stage, without altering learned model weights. A $k$ -bit payload is encrypted and mapped to a set of quantile-sliced latent coordinates, preserving the unconditional $\mathcal N(0,I)$ sampling distribution:

$z_{T,j}^s = ppf \left( \frac{u+i}{2^l} \right)$

where $u \sim \operatorname{Unif}(0,1)$ and $i$ indexes the $l$ -bit encrypted payload. Sample quality is provably lossless (KL divergence zero), and both single- and multi-bit signatures can be reliably extracted via DDIM inversion (Yang et al., 2024).

2.2. Side-Channel and Nonintrusive Approaches

ShadowMark / NWaaS: This paradigm implements watermarking entirely as a nonintrusive side channel. A secret "key" is mapped through an encoder $G_\gamma$ to produce a normal input, sent to the unmodified model $\mathbb{M}_\theta$ . The output is decoded via a lightweight decoder $D_\delta$ to recover the owner's image-shaped watermark $m$ . No model parameters are changed at any stage, and normal queries experience perfect fidelity (An et al., 24 Jul 2025):

API Component	Role
Key encoder ( $G_\gamma$ )	Synthesizes input from key
Protected model ( $\mathbb{M}_\theta$ )	Computes task output
Decoder ( $D_\delta$ )	Extracts watermark

Empirical results show no performance drop and strong robustness to model stealing and brute-force key guessing.

2.3. High-Capacity and User-Level Box-Free Watermarks

Hot-Swap MarkBoard: For scalable per-user attribution, a multi-branch LoRA (low-rank adaptation) module is attached to a frozen backbone. Each branch encodes an independent watermark bit. Training is performed once for the all-clean (no watermarks) and all-watermarked models; user-specific signatures are assigned by "hot-swapping" branches, producing up to $2^n$ unique models. An obfuscation matrix couples branch weights and the backbone, preventing removal without catastrophic task performance loss (Zhang et al., 28 Jul 2025).

Verification is performed by black-box triggering of branches with special inputs, recovering user signatures with 100% accuracy and maintaining $<1\%$ task degradation across diverse DL architectures and tasks.

3. Security and Robustness Considerations

While box-free schemes eliminate many deployment barriers, their attack surface is fundamentally distinct:

Removal via Query-Based Reverse Engineering: Cascaded architectures with an appended hiding network (e.g., GNet $\to$ HNet) are susceptible to attackers who bypass the generation network with "identity" queries or learn an inverse of the marking network via straightforward supervised regression. Additive-residual designs are especially vulnerable (An et al., 24 Jul 2025).
Extractor-Gradient and Zeroth-Order Attacks: When the watermark extractor is differentiable or query-accessible, attackers can synthesize removal networks that erase or overwrite embedded watermarks while preserving visual fidelity (PSNR $> 34$ dB; success rate $=1.00$ ) (An et al., 2024).
Surrogate Model Training: A well-trained extractor will often generalize the watermark to naive surrogate models. Two-stage fine-tuning (adversarial retraining of the extractor with surrogates) is essential for robustness (Zhang et al., 2020).
Parameter and Branch Removal: Where watermark-carrying modules (e.g., LoRA branches) are additively decoupled, naive removal disables the watermark but can be foiled by entangling with the backbone’s main weights (Zhang et al., 28 Jul 2025).

Countermeasures include:

Joint adversarial training of embedding/extraction networks against a proliferation of removal attempts.
Non-trivial side-channels and cryptographically secure encoding of triggers (e.g., via private keys or off-chain management).
Restriction of identity-bypass queries through runtime input filtering or output randomization.

4. Box-Free Watermarking Beyond Images

While most recent innovation targets images and diffusion models, several extensions exist for other modalities:

LLMs: Branch backdoor protocols encode forensic branches triggered by cryptographically authenticated inputs (e.g., HMAC-based MAC) and offer performance-lossless, user-attributable watermarking under pure black-box querying (Zhao et al., 2023).
Multimodal/CLIP Models: AGATE generates stealthy, in-distribution adversarial triggers coupled with a private post-transform module, enabling ownership proof in dual-modal retrieval/classification settings with negligible main-task drop (Gao et al., 28 Apr 2025).
3D Meshes: Multi-resolution Haar wavelet bases, fuzzy inference scaling, and Arnold scrambling yield PSNR $>74$ dB and correlation $>0.9$ after moderate attacks (Tamane et al., 2012).

5. Empirical Evaluation and Limitations

Leading box-free watermarking approaches are distinguished by their empirical fidelity, robustness, and efficiency:

Method	Fidelity Loss	Extraction Robustness	Empirical Limits
Gaussian Shading (Yang et al., 2024)	0	$>0.99$ TPR, FPR $<10^{-6}$	Attacker-resistant up to moderate attack
Wide Flat Minimum (Fei et al., 2023)	$\sim$ 0.1 FID	$>0.98$ bits, $>0.8$ after pruning	White-box & surrogate robust, but add-inverse vulnerable
NWaaS/ShadowMark (An et al., 24 Jul 2025)	0	PCC $>$ 0.95, NCCD $>$ 0.6	Side-channel design for X-to-image only
Hot-Swap MarkBoard (Zhang et al., 28 Jul 2025)	$<1\%$ accuracy	100% bit/ID accuracy	Protection depends on branch entangling
AGATE (Gao et al., 28 Apr 2025)	$\le$ 0.3% drop	Attacks all fail	Two-phase test depends on post-transform

Some methods have boundary cases or open threats:

Side-channel keys require secure management (An et al., 24 Jul 2025).
Additive designs can be undone by tailored reverse engineering (An et al., 24 Jul 2025).
Differentiable extractors can always be attacked via gradient estimation or learned removers (An et al., 2024).
Model-agnosticity is incomplete for tabular or text-only models.

6. Emerging Directions and Challenges

Recent research calls attention to key unresolved issues:

No currently published box-free design is provably removal-proof under all black-box settings. In particular, the combination of direct output extraction and accessible extractor gradients constitutes a persistent vulnerability (An et al., 2024, An et al., 24 Jul 2025).
Advances in side-channel watermarking (NWaaS) demonstrate the feasibility of perfect-fidelity watermarking for output-rich models, but generalization to arbitrary deep architectures remains open (An et al., 24 Jul 2025).
High-capacity, multi-user watermarking without retraining (e.g., Hot-Swap MarkBoard) is now practical, but the schemes’ resistance to long-horizon collusion and reverse engineering is still being explored (Zhang et al., 28 Jul 2025).

Box-free model watermarking is a rapidly evolving field at the intersection of deep learning, cryptography, and steganography, balancing maximal fidelity, extreme scalability, and removal resilience. Ongoing research is focused on adversarially robust embedding, adaptive randomization, and theoretical lower bounds on attack cost.