Papers
Topics
Authors
Recent
Search
2000 character limit reached

Watermarking Large Language Models

Updated 17 January 2026
  • Watermarking large language models are methods that embed invisible statistical markers into outputs or parameters using cryptographic and optimization techniques.
  • Techniques like token-level logit biasing and weight-level embedding enable post-hoc attribution, copyright enforcement, and provenance tracking.
  • Adaptive watermarking schemes balance text quality and robust detection, ensuring resilience against paraphrasing and adversarial modifications.

Watermarking LLMs refers to a family of machine learning, optimization, and cryptographic techniques for embedding imperceptible statistical markers (“watermarks”) into the outputs or parameters of LLMs. The primary goals are to enable reliable post-hoc attribution of generated content (differentiating AI-generated from human-written text), enforce copyright protection, or trace the provenance and usage of both LLM outputs and model parameters themselves. Watermarking methods are increasingly central for countering AI misuse, enforcing accountability, and establishing digital provenance in the face of LLMs whose outputs are syntactically and semantically indistinguishable from human-authored text (Liang et al., 2024).

1. Classification and Formal Definition

LLM watermarking schemes fall into several taxonomic dimensions, each grounded in precise algorithmic formalisms:

  • Embedding domain:
    • Token-level: Modifying next-token probability distributions via real-time logit perturbations (e.g., green/red token lists, KGW).
    • Sentence-level/semantic: Accepting or rejecting candidate sentences in semantic codebooks or clusters, as in semantic watermarks.
    • Model-parameter-level: Direct alteration or fine-tuning of network weights (e.g., knowledge injection, weight quantization, model editing).
    • Character-level/postprocessing: Injecting invisible Unicode sequences or similar encodings.
  • Model access: White-box (full parameter access), black-box (API or outputs only), gray-box, or no-box (content-based).
  • Payload type: Zero-bit (detection only; no ID), multi-bit (traceable payload: user ID, model version), or cryptographic (provable ownership, binding via secret keys) (Liang et al., 2024).
  • Statistical vs. cryptographic: Statistical schemes rely on probabilistic evidence (e.g., green-token counts), while cryptographic approaches leverage PRFs or digital signatures for theoretically undetectable or non-repudiable attribution.

Formally, a watermark is a randomized algorithm pair (Embed, Detect), parameterized by a secret key kk:

xw=Embed(x,k),Detect(y,k){marked,clean}x_w = \mathsf{Embed}(x, k), \qquad \mathsf{Detect}(y, k) \to \{\text{marked}, \text{clean}\}

with the test statistic Tk(y)T_k(y) compared to a threshold calibrated for false-positive rate α\alpha (Liang et al., 2024).

2. Token-Level Inference-Time Watermarking

The dominant zero- and multi-bit watermarking protocols bias each sampling step to increase the likelihood of selecting tokens from a secret subset (the “green list”), detectable post hoc via statistical testing. A canonical example is the logit-biasing scheme (KGW):

  • Green list generation: For each token step tt, generate a pseudo-random green list GtVG_t \subset V of fraction γ\gamma using a PRF seeded with context and key.
  • Logit modification: Add a constant bias δ\delta to the logits of green-list tokens, leaving the others untouched:

~t(v)=t(v)+δ1vGt\tilde{\ell}_t(v) = \ell_t(v) + \delta \cdot \mathbf{1}_{v\in G_t}

  • Sampling and detection: Tokens are sampled from the modified softmax; detection involves computing the sum ST=t=1T1[x(t)Gt]S_T = \sum_{t=1}^T \mathbf{1}[x^{(t)}\in G_{t}] and applying an exact binomial test:

$S_T \sim \mathrm{Binomial}(T, \gamma) \quad \text{under $H_0$}$

The zz-score is computed as

z=STTγTγ(1γ)z = \frac{S_T - T\gamma}{\sqrt{T\gamma(1-\gamma)}}

and compared to a detection threshold controlling FPR (Fernandez et al., 2023, Liang et al., 2024).

Pareto-optimal and Adaptive Schemes

Recent work has demonstrated that adaptively tuning the watermark’s insertion strength according to the per-step impact on text quality yields strictly better trade-offs. For a class of watermark shift functions, Pareto optimality is achieved by biasing only steps where the expected text “damage” statistic B(pt,Gt)B(p_t, G_t) is below a threshold β\beta:

ΔOPT(pt,Gt)={1Γt,B(pt,Gt)β 0,otherwise\Delta_{\rm OPT}(p_t, G_t) = \begin{cases} 1-\Gamma_t, & B(p_t, G_t)\leq\beta \ 0, & \text{otherwise} \end{cases}

where Γt\Gamma_t is the base probability of green tokens and BB measures text distortion (Wouters, 2023). More advanced techniques extend this to token-specific adaptation using neural predictors, optimizing detection and semantic loss jointly via multi-objective (Pareto frontier) methods and lightweight plug-in MLPs (Huo et al., 2024, Wang et al., 13 Oct 2025).

Method Bias Adaptation Key Optimization Principle
KGW Fixed γ,δ\gamma, \delta Logit bias on random token lists
OPT (Wouters, 2023) Per-step (β\beta) Watermark only if “damage” is low
Token-Specific Per-token via MLP Multi-objective (detect+coherence)

3. Multi-Bit and High-Capacity Watermarking

Identifiable watermarking—embedding user IDs, provenance, or message strings—requires multi-bit payload transmission within LLM outputs. State-of-the-art schemes leverage position allocation and per-token (or block-wise) partitioning strategies. Examples include:

  • Position-Allocation Multi-bit Watermark (MPAC): Converts the payload into rr-ary digits and, at each position, selects a vocabulary chunk to bias in correspondence with the desired code symbol. Decoding is performed by chunk-wise counting, enabling robust recovery even in adversarial or mixed-content scenarios (Yoo et al., 2023).
  • Majority Bit-Aware (MajorMark): Exploits the statistical prevalence of the majority bit in each message to maximally enlarge the preferred token set, preserving fluency at higher capacity. Decoding uses clustering over occurrence vectors, and a block-wise deterministic variant further amplifies both capacity and accuracy (Xu et al., 5 Aug 2025).

Recent multi-bit embedding algorithms often achieve bit-accuracies above 90% for payloads in the 32–64 bit range, retain zero-bit detection capability, and demonstrate resistance to deletion, insertion, and moderate paraphrasing attacks (Xu et al., 5 Aug 2025, Feng et al., 19 Jun 2025, Lin et al., 4 Feb 2025, Jiang et al., 5 Jun 2025).

4. Model-Centric and Weight-Level Watermarks

Watermarking may also target LLM model parameters themselves for copyright or ownership verification:

  • Knowledge Injection: Embedding multi-bit information by fine-tuning on synthetic Q&A exemplars that encode a hidden message in mathematical function definitions or general knowledge (Li et al., 2023).
  • Weight Quantization Watermarking: Embedding a trigger signature in the fp32 weights while preserving the quantized INT8 model, enabled by optimizing within quantization intervals; robust to quantization but not fine-tuning (Li et al., 2023).
  • Invariant-Based Embedding: Modifying embedding-layer weights to satisfy secret linear invariants derived from model parameters, offering robust resistance to pruning, quantization, permutation, scaling, and even multi-user collusion, with statistical decision rules and negligible loss in downstream metrics (Guo et al., 11 Jul 2025).

These approaches typically deliver <0.5% accuracy drop on standard tasks, support rapid verification, and can survive heavy model-level perturbations provided the attack does not erase the invariant or the trigger set.

5. Quality–Robustness Trade-offs and Detection Guarantees

The core challenges in watermarking are optimizing the trade-off curve between detectability (true-positive rate at low false-positive rates), generation quality (textual fluency, perplexity, semantic similarity), and capacity (bits/token for multi-bit schemes):

  • Pareto-Optimality: For token-level methods, optimal trade-offs are realized by only watermarking when expected text damage is minimal (Wouters, 2023).
  • Adaptive Selective Watermarking: Employing selector networks (MLPs) that gate watermark insertion based on semantic and entropy features, yielding Pareto-dominant results compared to hand-tuned or fixed-threshold methods (Wang et al., 13 Oct 2025, Huo et al., 2024).
  • Unbiased Multilayer Watermarking: Protocols such as BiMark preserve the output distribution exactly in expectation, enabling model-agnostic, message-agnostic detection with high extraction rates at negligible perplexity cost (Feng et al., 19 Jun 2025, Chen et al., 16 Feb 2025, Jiang et al., 5 Jun 2025).
  • Statistical Testing: Detection is canonically formulated as an exact binomial or normal z-test, with closed-form FPR/TPR guarantees. Advanced protocols employ chunk-level or multi-seed search-based detection to increase robustness to insertion/deletion or semantic attacks (Lin et al., 30 Nov 2025, Fernandez et al., 2023).

6. Limitations, Attacks, and Future Directions

Practical deployment of LLM watermarking faces several open technical challenges:

  • Adversarial robustness: Watermark signal is degraded by aggressive paraphrasing, insertion/deletion, or re-generation beyond 20–30% edit rates for most token-level schemes (Bao et al., 16 Sep 2025, Huo et al., 2024).
  • Key management and revocation: For cryptographic schemes, secure distribution and possible revocation of watermark keys pose system-level difficulties.
  • Capacity vs. quality: Lifting per-token capacity without statistical or semantic degradation remains a leading open question, motivating blockwise, position-based, and distribution-preserving embedding algorithms.
  • Multilingual/multimodal extension: Generalizing watermarking paradigms to support non-English, code, audio, image, or cross-modal LLMs is an active area of research (Liang et al., 2024).
  • Unified benchmarking: There is a recognized need for standardized datasets and robust evaluation pipelines, encompassing TPR/FPR, PPL/BLEU/ROUGE, attack scenarios, and detection time/overhead (Bao et al., 16 Sep 2025, Liang et al., 2024).

Emergent directions include:

7. Summary Table: Representative Families of LLM Watermarks

Approach Embedding Domain Detection Mode Capacity Quality Impact Robustness
KGW/OPT Token logit reweight Green-token binomial Zero/multi \leq1 PPL (tuned) Moderate (edits)
Token‐specific MOO Token logit, MLP Differentiable z-stat Zero Pareto-optimized Superior (attacks)
MajorMark/MPAC/DERMARK Token position/block Cluster/count decode Multi-bit Manageable (~6–7 PPL) High (block-based)
Weight Invariant Embedding/param layer Invariant linear test Multi-bit <0.5%<0.5\% task delta High (model-level)
Knowledge Injection Data, fine-tuning Output Q&A parse Multi-bit Negligible (0.3%) Paraphrase-tolerant
WaterSearch Chunkwise search Chunk-wise binomial Zero/multi 50% task gain @95% Up to 80% edits
Quantized WM Weight perturb (fp32) Trigger output Low-bit TMR=100% (INT8) Robust to quant.

Current watermarking for LLMs relies on a technical spectrum of inference-time, model-level, and hybrid embedding strategies, with rigorous formal guarantees over their detectability, imperceptibility, and robustness under adversarial or benign transformations. Ongoing research is defined by multi-objective optimization, increased embedding capacity, and enhanced forensic resilience, with significant implications for copyright, AI security, and digital provenance. Key works include (Huo et al., 2024, Wouters, 2023, Wang et al., 13 Oct 2025, Feng et al., 19 Jun 2025, Bao et al., 16 Sep 2025, Li et al., 2023, Jiang et al., 5 Jun 2025, Xu et al., 5 Aug 2025, Li et al., 2023, Guo et al., 11 Jul 2025, Lin et al., 30 Nov 2025), and (Liang et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Watermarking Large Language Models.