Watermarking Large Language Models

Updated 17 January 2026

Watermarking large language models are methods that embed invisible statistical markers into outputs or parameters using cryptographic and optimization techniques.
Techniques like token-level logit biasing and weight-level embedding enable post-hoc attribution, copyright enforcement, and provenance tracking.
Adaptive watermarking schemes balance text quality and robust detection, ensuring resilience against paraphrasing and adversarial modifications.

Watermarking LLMs refers to a family of machine learning, optimization, and cryptographic techniques for embedding imperceptible statistical markers (“watermarks”) into the outputs or parameters of LLMs. The primary goals are to enable reliable post-hoc attribution of generated content (differentiating AI-generated from human-written text), enforce copyright protection, or trace the provenance and usage of both LLM outputs and model parameters themselves. Watermarking methods are increasingly central for countering AI misuse, enforcing accountability, and establishing digital provenance in the face of LLMs whose outputs are syntactically and semantically indistinguishable from human-authored text (Liang et al., 2024).

1. Classification and Formal Definition

LLM watermarking schemes fall into several taxonomic dimensions, each grounded in precise algorithmic formalisms:

Embedding domain:
- Token-level: Modifying next-token probability distributions via real-time logit perturbations (e.g., green/red token lists, KGW).
- Sentence-level/semantic: Accepting or rejecting candidate sentences in semantic codebooks or clusters, as in semantic watermarks.
- Model-parameter-level: Direct alteration or fine-tuning of network weights (e.g., knowledge injection, weight quantization, model editing).
- Character-level/postprocessing: Injecting invisible Unicode sequences or similar encodings.
Model access: White-box (full parameter access), black-box (API or outputs only), gray-box, or no-box (content-based).
Payload type: Zero-bit (detection only; no ID), multi-bit (traceable payload: user ID, model version), or cryptographic (provable ownership, binding via secret keys) (Liang et al., 2024).
Statistical vs. cryptographic: Statistical schemes rely on probabilistic evidence (e.g., green-token counts), while cryptographic approaches leverage PRFs or digital signatures for theoretically undetectable or non-repudiable attribution.

Formally, a watermark is a randomized algorithm pair (Embed, Detect), parameterized by a secret key $k$ :

$x_w = \mathsf{Embed}(x, k), \qquad \mathsf{Detect}(y, k) \to \{\text{marked}, \text{clean}\}$

with the test statistic $T_k(y)$ compared to a threshold calibrated for false-positive rate $\alpha$ (Liang et al., 2024).

2. Token-Level Inference-Time Watermarking

The dominant zero- and multi-bit watermarking protocols bias each sampling step to increase the likelihood of selecting tokens from a secret subset (the “green list”), detectable post hoc via statistical testing. A canonical example is the logit-biasing scheme (KGW):

Green list generation: For each token step $t$ , generate a pseudo-random green list $G_t \subset V$ of fraction $\gamma$ using a PRF seeded with context and key.
Logit modification: Add a constant bias $\delta$ to the logits of green-list tokens, leaving the others untouched:

$\tilde{\ell}_t(v) = \ell_t(v) + \delta \cdot \mathbf{1}_{v\in G_t}$

Sampling and detection: Tokens are sampled from the modified softmax; detection involves computing the sum $S_T = \sum_{t=1}^T \mathbf{1}[x^{(t)}\in G_{t}]$ and applying an exact binomial test:

$x_w = \mathsf{Embed}(x, k), \qquad \mathsf{Detect}(y, k) \to \{\text{marked}, \text{clean}\}$ 0

The $x_w = \mathsf{Embed}(x, k), \qquad \mathsf{Detect}(y, k) \to \{\text{marked}, \text{clean}\}$ 1-score is computed as

$x_w = \mathsf{Embed}(x, k), \qquad \mathsf{Detect}(y, k) \to \{\text{marked}, \text{clean}\}$ 2

and compared to a detection threshold controlling FPR (Fernandez et al., 2023, Liang et al., 2024).

Pareto-optimal and Adaptive Schemes

Recent work has demonstrated that adaptively tuning the watermark’s insertion strength according to the per-step impact on text quality yields strictly better trade-offs. For a class of watermark shift functions, Pareto optimality is achieved by biasing only steps where the expected text “damage” statistic $x_w = \mathsf{Embed}(x, k), \qquad \mathsf{Detect}(y, k) \to \{\text{marked}, \text{clean}\}$ 3 is below a threshold $x_w = \mathsf{Embed}(x, k), \qquad \mathsf{Detect}(y, k) \to \{\text{marked}, \text{clean}\}$ 4:

$x_w = \mathsf{Embed}(x, k), \qquad \mathsf{Detect}(y, k) \to \{\text{marked}, \text{clean}\}$ 5

where $x_w = \mathsf{Embed}(x, k), \qquad \mathsf{Detect}(y, k) \to \{\text{marked}, \text{clean}\}$ 6 is the base probability of green tokens and $x_w = \mathsf{Embed}(x, k), \qquad \mathsf{Detect}(y, k) \to \{\text{marked}, \text{clean}\}$ 7 measures text distortion (Wouters, 2023). More advanced techniques extend this to token-specific adaptation using neural predictors, optimizing detection and semantic loss jointly via multi-objective (Pareto frontier) methods and lightweight plug-in MLPs (Huo et al., 2024, Wang et al., 13 Oct 2025).

Method	Bias Adaptation	Key Optimization Principle
KGW	Fixed $x_w = \mathsf{Embed}(x, k), \qquad \mathsf{Detect}(y, k) \to \{\text{marked}, \text{clean}\}$ 8	Logit bias on random token lists
OPT (Wouters, 2023)	Per-step ( $x_w = \mathsf{Embed}(x, k), \qquad \mathsf{Detect}(y, k) \to \{\text{marked}, \text{clean}\}$ 9)	Watermark only if “damage” is low
Token-Specific	Per-token via MLP	Multi-objective (detect+coherence)

3. Multi-Bit and High-Capacity Watermarking

Identifiable watermarking—embedding user IDs, provenance, or message strings—requires multi-bit payload transmission within LLM outputs. State-of-the-art schemes leverage position allocation and per-token (or block-wise) partitioning strategies. Examples include:

Position-Allocation Multi-bit Watermark (MPAC): Converts the payload into $T_k(y)$ 0-ary digits and, at each position, selects a vocabulary chunk to bias in correspondence with the desired code symbol. Decoding is performed by chunk-wise counting, enabling robust recovery even in adversarial or mixed-content scenarios (Yoo et al., 2023).
Majority Bit-Aware (MajorMark): Exploits the statistical prevalence of the majority bit in each message to maximally enlarge the preferred token set, preserving fluency at higher capacity. Decoding uses clustering over occurrence vectors, and a block-wise deterministic variant further amplifies both capacity and accuracy (Xu et al., 5 Aug 2025).

Recent multi-bit embedding algorithms often achieve bit-accuracies above 90% for payloads in the 32–64 bit range, retain zero-bit detection capability, and demonstrate resistance to deletion, insertion, and moderate paraphrasing attacks (Xu et al., 5 Aug 2025, Feng et al., 19 Jun 2025, Lin et al., 4 Feb 2025, Jiang et al., 5 Jun 2025).

4. Model-Centric and Weight-Level Watermarks

Watermarking may also target LLM model parameters themselves for copyright or ownership verification:

Knowledge Injection: Embedding multi-bit information by fine-tuning on synthetic Q&A exemplars that encode a hidden message in mathematical function definitions or general knowledge (Li et al., 2023).
Weight Quantization Watermarking: Embedding a trigger signature in the fp32 weights while preserving the quantized INT8 model, enabled by optimizing within quantization intervals; robust to quantization but not fine-tuning (Li et al., 2023).
Invariant-Based Embedding: Modifying embedding-layer weights to satisfy secret linear invariants derived from model parameters, offering robust resistance to pruning, quantization, permutation, scaling, and even multi-user collusion, with statistical decision rules and negligible loss in downstream metrics (Guo et al., 11 Jul 2025).

These approaches typically deliver <0.5% accuracy drop on standard tasks, support rapid verification, and can survive heavy model-level perturbations provided the attack does not erase the invariant or the trigger set.

5. Quality–Robustness Trade-offs and Detection Guarantees

The core challenges in watermarking are optimizing the trade-off curve between detectability (true-positive rate at low false-positive rates), generation quality (textual fluency, perplexity, semantic similarity), and capacity (bits/token for multi-bit schemes):

Pareto-Optimality: For token-level methods, optimal trade-offs are realized by only watermarking when expected text damage is minimal (Wouters, 2023).
Adaptive Selective Watermarking: Employing selector networks (MLPs) that gate watermark insertion based on semantic and entropy features, yielding Pareto-dominant results compared to hand-tuned or fixed-threshold methods (Wang et al., 13 Oct 2025, Huo et al., 2024).
Unbiased Multilayer Watermarking: Protocols such as BiMark preserve the output distribution exactly in expectation, enabling model-agnostic, message-agnostic detection with high extraction rates at negligible perplexity cost (Feng et al., 19 Jun 2025, Chen et al., 16 Feb 2025, Jiang et al., 5 Jun 2025).
Statistical Testing: Detection is canonically formulated as an exact binomial or normal z-test, with closed-form FPR/TPR guarantees. Advanced protocols employ chunk-level or multi-seed search-based detection to increase robustness to insertion/deletion or semantic attacks (Lin et al., 30 Nov 2025, Fernandez et al., 2023).

6. Limitations, Attacks, and Future Directions

Practical deployment of LLM watermarking faces several open technical challenges:

Adversarial robustness: Watermark signal is degraded by aggressive paraphrasing, insertion/deletion, or re-generation beyond 20–30% edit rates for most token-level schemes (Bao et al., 16 Sep 2025, Huo et al., 2024).
Key management and revocation: For cryptographic schemes, secure distribution and possible revocation of watermark keys pose system-level difficulties.
Capacity vs. quality: Lifting per-token capacity without statistical or semantic degradation remains a leading open question, motivating blockwise, position-based, and distribution-preserving embedding algorithms.
Multilingual/multimodal extension: Generalizing watermarking paradigms to support non-English, code, audio, image, or cross-modal LLMs is an active area of research (Liang et al., 2024).
Unified benchmarking: There is a recognized need for standardized datasets and robust evaluation pipelines, encompassing TPR/FPR, PPL/BLEU/ROUGE, attack scenarios, and detection time/overhead (Bao et al., 16 Sep 2025, Liang et al., 2024).

Emergent directions include:

Adaptive, multi-objective optimized watermark selectors (Huo et al., 2024, Wang et al., 13 Oct 2025);
Robust, invariant- or knowledge-based weight-level marks for proprietary models (Li et al., 2023, Guo et al., 11 Jul 2025);
Dynamically segmented and bandwidth-efficient multi-bit schemes for provenance tracing (Lin et al., 4 Feb 2025, Xu et al., 5 Aug 2025, Yoo et al., 2023);
Search-based, quality-aware chunkwise embedding that leverages parallel generation for maximal downstream quality at fixed detectability (Lin et al., 30 Nov 2025).

7. Summary Table: Representative Families of LLM Watermarks

Approach	Embedding Domain	Detection Mode	Capacity	Quality Impact	Robustness
KGW/OPT	Token logit reweight	Green-token binomial	Zero/multi	$T_k(y)$ 11 PPL (tuned)	Moderate (edits)
Token‐specific MOO	Token logit, MLP	Differentiable z-stat	Zero	Pareto-optimized	Superior (attacks)
MajorMark/MPAC/DERMARK	Token position/block	Cluster/count decode	Multi-bit	Manageable (~6–7 PPL)	High (block-based)
Weight Invariant	Embedding/param layer	Invariant linear test	Multi-bit	$T_k(y)$ 2 task delta	High (model-level)
Knowledge Injection	Data, fine-tuning	Output Q&A parse	Multi-bit	Negligible (0.3%)	Paraphrase-tolerant
WaterSearch	Chunkwise search	Chunk-wise binomial	Zero/multi	50% task gain @95%	Up to 80% edits
Quantized WM	Weight perturb (fp32)	Trigger output	Low-bit	TMR=100% (INT8)	Robust to quant.

Current watermarking for LLMs relies on a technical spectrum of inference-time, model-level, and hybrid embedding strategies, with rigorous formal guarantees over their detectability, imperceptibility, and robustness under adversarial or benign transformations. Ongoing research is defined by multi-objective optimization, increased embedding capacity, and enhanced forensic resilience, with significant implications for copyright, AI security, and digital provenance. Key works include (Huo et al., 2024, Wouters, 2023, Wang et al., 13 Oct 2025, Feng et al., 19 Jun 2025, Bao et al., 16 Sep 2025, Li et al., 2023, Jiang et al., 5 Jun 2025, Xu et al., 5 Aug 2025, Li et al., 2023, Guo et al., 11 Jul 2025, Lin et al., 30 Nov 2025), and (Liang et al., 2024).